How do I scrape a website for information?

Asked 12 years, 10 months ago

Viewed 226 times

I want my program to automatically download only certain information off a website. After finding out that this is nearly impossible I figured it would be best if the program would just download the entire web page and then find the information that I needed inside of a string.

How can I find certain words/numbers after specific words? The word before the number I want to have is always the same. The number varies and that is the number I need in my program.

Improve this question

edited Mar 5, 2013 at 10:23

J. Steen's user avatar

J. Steen

15.6k15 gold badges59 silver badges64 bronze badges

asked Mar 5, 2013 at 10:21

platypusq's user avatar

platypusq

161 bronze badge

could you please post an example text?

BergListe
– BergListe

2013年03月05日 10:23:54 +00:00
Commented Mar 5, 2013 at 10:23
first you need to make sure, the word is unique and then you can use msdn.microsoft.com/de-de/library/…

Vogel612
– Vogel612

2013年03月05日 10:24:09 +00:00
Commented Mar 5, 2013 at 10:24
I have edited your question for clarity and used phrases more known to the community. If any of my changes were incorrect, please make edits yourself to clarify your question.

J. Steen
– J. Steen

2013年03月05日 10:24:18 +00:00
Commented Mar 5, 2013 at 10:24
Your question's a bit too vague. Provide more context and some example code if you can. 'Downloading certain information off a website' is not necessarily impossible depending on the details of it. Look into screen scraping.

user1017882
– user1017882

2013年03月05日 10:24:38 +00:00
Commented Mar 5, 2013 at 10:24
Any update on this issue?

Joel Peltonen
– Joel Peltonen

2014年09月26日 07:52:46 +00:00
Commented Sep 26, 2014 at 7:52

Add a comment |

2 Answers 2

Sorted by: Reset to default

Sounds like screen scraping. I recommend using CSQuery https://github.com/jamietre/CsQuery (or HtmlAgilityPack if you want). Get the source, parse as object, loop over all text nodes and do your string comparison there. The actual way of doing this varies a LOT on how the source HTML is done.

Maby something like this untested example written from memory (CSQuery)

var dom = CQ.Create(stringWithHtml);
dom["*"].Each((i, e) =>
{
 // handle only text nodes
 if (e.NodeType == NodeType.TEXT_NODE) {
 // do your check here
 }
}

Improve this answer

answered Mar 5, 2013 at 10:25

Joel Peltonen's user avatar

Joel Peltonen

13.5k7 gold badges68 silver badges108 bronze badges

Comments

I've used HTML Agility Pack for multiple applications and it works well. Lots of options too.

It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.

Improve this answer

edited Nov 28, 2017 at 3:46

wp78de's user avatar

wp78de

19.2k7 gold badges49 silver badges79 bronze badges

answered Mar 5, 2013 at 10:29

jordanhill123's user avatar

jordanhill123

4,2032 gold badges34 silver badges40 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-cs

CollectivesTM on Stack Overflow

How do I scrape a website for information?

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related