Commit 2c317db

authored

Add files via upload

1 parent 93c9ef7 commit 2c317dbCopy full SHA for 2c317db

File tree

+23

-0

lines changed

+23

-0

lines changed

Lines changed: 23 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,23 @@`
	`1`	`+import requests`
	`2`	`+from bs4 import BeautifulSoup`
	`3`	`+`
	`4`	`+def spider(max_pages): # max pages to crawl through`
	`5`	`+ page = 1`
	`6`	`+ while page <= max_pages:`
	`7`	`+ url = 'https://en.wikipedia.org/wiki/Hello' #url to crawl`
	`8`	`+ source_code = requests.get(url) #to gather the source code of the web page to crwal through`
	`9`	`+ plain_text = source_code.text # it removes all the crap like headers and all from source_code and stores`
	`10`	`+ # only texts and images, rather the good stuff`
	`11`	`+ soup = BeautifulSoup(plain_text) # this variable gathers all the specific stuff in soure code that is required`
	`12`	`+`
	`13`	`+ for line in soup.findAll('a', {'class':'mw-disambig'}): #this finds all the class name of the given titles we need to crawl through`
	`14`	`+ #"mw-disambig" is the class name to crawl through to find the titles in the webpage`
	`15`	`+ href = link.get('href') # it gathers 'href'/url from the source code of "only" the title`
	`16`	`+ print(href)`
	`17`	`+ page += 1`
	`18`	`+`
	`19`	`+spider(1)`
	`20`	`+`
	`21`	`+`
	`22`	`+`
	`23`	`+`

Comments

(0)