In .net what is the best way to scrape HTML web pages.
Is there something open source that runs on .net framework 2 and and put all the html into objects. I have read about "HTML Agility Pack" but is there any think else?
asked Jul 17, 2012 at 11:13
Hello-World
9,57524 gold badges92 silver badges157 bronze badges
-
Why did you tag this with c# and vb.net?ThiefMaster– ThiefMaster2012年07月17日 11:16:31 +00:00Commented Jul 17, 2012 at 11:16
-
Are you trying to scape pages, or process pages? Do you need to look at the contextual information from the DOM or just Spider to duplicate?nyxthulhu– nyxthulhu2012年07月17日 11:17:05 +00:00Commented Jul 17, 2012 at 11:17
-
I want the vb.net to open the page and look at the html the take what it needs. I thought .net as it has more power than javscript and the sites wont be on my serverHello-World– Hello-World2012年07月17日 11:20:07 +00:00Commented Jul 17, 2012 at 11:20
-
What are you looking to extract from the HTML you get back?dtsg– dtsg2012年07月17日 11:21:21 +00:00Commented Jul 17, 2012 at 11:21
-
developer.mindtouch.com/SgmlReaderGovind Malviya– Govind Malviya2012年07月17日 11:21:24 +00:00Commented Jul 17, 2012 at 11:21
2 Answers 2
I think HtmlAgilityPack is but you can also use
- Fizzler : css selector engine for C#
- SgmlReader : Convert html to valid xml
- SharpQuery : Alternative of fizzler
answered Jul 17, 2012 at 11:31
Govind Malviya
13.8k17 gold badges70 silver badges95 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
You might use Tidy.net, which is a c# wrapper for the Tidy Library to convert HTML in XHTML available here: http://sourceforge.net/projects/tidynet/ so you could get valid XML and process it as such.
I'd make it this way:
// don't forget to import TidyNet and System.Xml.Linq
var t = new Tidy();
TidyMessageCollection messages = new TidyMessageCollection();
t.Options.Xhtml = true;
//extra options if you plan to edit the result by hand
t.Options.IndentContent = true;
t.Options.SmartIndent = true;
t.Options.DropEmptyParas = true;
t.Options.DropFontTags = true;
t.Options.BreakBeforeBR = true;
string sInput = "your html code goes here";
var bytes = System.Text.Encoding.UTF8.GetBytes(sInput);
StringBuilder sbOutput = new StringBuilder();
var msIn = new MemoryStream(bytes);
var msOut = new MemoryStream();
t.Parse(msIn, msOut, messages);
var bytesOut = msOut.ToArray();
string sOut = System.Text.Encoding.UTF8.GetString(bytesOut);
XDocument doc = XDocument.Parse(sOut);
//process XML as you like
Otherwise, HTML Agility pack is ok.
answered Jul 17, 2012 at 13:05
Max Lambertini
3,7491 gold badge22 silver badges25 bronze badges
Comments
default