1

In .net what is the best way to scrape HTML web pages.

Is there something open source that runs on .net framework 2 and and put all the html into objects. I have read about "HTML Agility Pack" but is there any think else?

asked Jul 17, 2012 at 11:13
6
  • Why did you tag this with c# and vb.net? Commented Jul 17, 2012 at 11:16
  • Are you trying to scape pages, or process pages? Do you need to look at the contextual information from the DOM or just Spider to duplicate? Commented Jul 17, 2012 at 11:17
  • I want the vb.net to open the page and look at the html the take what it needs. I thought .net as it has more power than javscript and the sites wont be on my server Commented Jul 17, 2012 at 11:20
  • What are you looking to extract from the HTML you get back? Commented Jul 17, 2012 at 11:21
  • developer.mindtouch.com/SgmlReader Commented Jul 17, 2012 at 11:21

2 Answers 2

2

I think HtmlAgilityPack is but you can also use

  1. Fizzler : css selector engine for C#
  2. SgmlReader : Convert html to valid xml
  3. SharpQuery : Alternative of fizzler
answered Jul 17, 2012 at 11:31
Sign up to request clarification or add additional context in comments.

Comments

1

You might use Tidy.net, which is a c# wrapper for the Tidy Library to convert HTML in XHTML available here: http://sourceforge.net/projects/tidynet/ so you could get valid XML and process it as such.

I'd make it this way:

 // don't forget to import TidyNet and System.Xml.Linq
 var t = new Tidy();
 TidyMessageCollection messages = new TidyMessageCollection();
 t.Options.Xhtml = true;
 //extra options if you plan to edit the result by hand
 t.Options.IndentContent = true;
 t.Options.SmartIndent = true;
 t.Options.DropEmptyParas = true;
 t.Options.DropFontTags = true;
 t.Options.BreakBeforeBR = true;
 string sInput = "your html code goes here";
 var bytes = System.Text.Encoding.UTF8.GetBytes(sInput);
 StringBuilder sbOutput = new StringBuilder();
 var msIn = new MemoryStream(bytes);
 var msOut = new MemoryStream();
 t.Parse(msIn, msOut, messages);
 var bytesOut = msOut.ToArray();
 string sOut = System.Text.Encoding.UTF8.GetString(bytesOut);
 XDocument doc = XDocument.Parse(sOut);
 //process XML as you like

Otherwise, HTML Agility pack is ok.

answered Jul 17, 2012 at 13:05

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.