This repository was archived by the owner on May 25, 2022. It is now read-only.

Commit 12a74a5

committed

added news scraper code and readme

1 parent 28f7a0f commit 12a74a5Copy full SHA for 12a74a5

File tree

6 files changed

+68

-0

lines changed

projects/News website scraper

6 files changed

+68

-0

lines changed

`‎projects/News website scraper/README.md‎`

Lines changed: 11 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,11 @@`
	`1`	`+# Financial-news-scraper`
	`2`	`+A scraper made using beautiful soup 4 in python. Tailor made for extracting news from moneycontrol.com. Issue pull request for different scrapers.`
	`3`	`+`
	`4`	`+__The main page to start scraping from: https://www.moneycontrol.com/news/technical-call-221.html__`
	`5`	`+![](images/home.JPG)`
	`6`	`+`
	`7`	`+__The program scrapes news from next pages too by extracting website link in these buttons__`
	`8`	`+![](images/nextpage.JPG)`
	`9`	`+`
	`10`	`+__Resulting JSON file includes heading, date and image link, indexed by page number__`
	`11`	`+![](images/result.JPG)`

`‎projects/News website scraper/images/README.md‎`

Lines changed: 5 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,5 @@`
	`1`	`+home.jpg - main news page`
	`2`	`+`
	`3`	`+nextpage.jpg - links to next pages`
	`4`	`+`
	`5`	`+result.jpg - Snapshot of result json file`

`‎projects/News website scraper/images/home.JPG‎`

241 KB

Loading[フレーム]

`‎projects/News website scraper/images/nextpage.JPG‎`

74.6 KB

Loading[フレーム]

`‎projects/News website scraper/images/result.JPG‎`

287 KB

Loading[フレーム]

`‎projects/News website scraper/moneycontrol_scrapper.py‎`

Lines changed: 52 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,52 @@`
	`1`	`+import re`
	`2`	`+import json`
	`3`	`+import requests`
	`4`	`+import datetime`
	`5`	`+from tqdm import tqdm`
	`6`	`+from bs4 import BeautifulSoup`
	`7`	`+from collections import defaultdict`
	`8`	`+`
	`9`	`+submission = defaultdict(list)`
	`10`	`+#main url`
	`11`	`+src_url = 'https://www.moneycontrol.com/news/technical-call-221.html'`
	`12`	`+`
	`13`	`+#get next page links and call scrap() on each link`
	`14`	`+def setup(url):`
	`15`	`+ nextlinks = []`
	`16`	`+ src_page = requests.get(url).text`
	`17`	`+ src = BeautifulSoup(src_page, 'lxml')`
	`18`	`+`
	`19`	`+ #ignore <a> with void js as href`
	`20`	`+ anchors = src.find("div", attrs={"class": "pagenation"}).findAll(`
	`21`	`+ 'a', {'href': re.compile('^((?!void).)*$')})`
	`22`	`+ nextlinks = [i.attrs['href'] for i in anchors]`
	`23`	`+ for idx, link in enumerate(tqdm(nextlinks)):`
	`24`	`+ scrap('https://www.moneycontrol.com'+link, idx)`
	`25`	`+`
	`26`	`+#scraps passed page url`
	`27`	`+def scrap(url, idx):`
	`28`	`+ src_page = requests.get(url).text`
	`29`	`+ src = BeautifulSoup(src_page, 'lxml')`
	`30`	`+`
	`31`	`+ span = src.find("ul", {"id": "cagetory"}).findAll('span')`
	`32`	`+ img = src.find("ul", {"id": "cagetory"}).findAll('img')`
	`33`	`+`
	`34`	`+ #<img> has alt text attr set as heading of news, therefore get img link and heading from same tag`
	`35`	`+ imgs = [i.attrs['src'] for i in img]`
	`36`	`+ titles = [i.attrs['alt'] for i in img]`
	`37`	`+ date = [i.get_text() for i in span]`
	`38`	`+`
	`39`	`+ #list of dicts as values and indexed by page number`
	`40`	`+ submission[str(idx)].append({'title': titles})`
	`41`	`+ submission[str(idx)].append({'date': date})`
	`42`	`+ submission[str(idx)].append({'img_src': imgs})`
	`43`	`+`
	`44`	`+#save data as json named by current date`
	`45`	`+def json_dump(data):`
	`46`	`+ date = datetime.date.today().strftime("%B %d, %Y")`
	`47`	`+ with open('moneycontrol_'+str(date)+'.json', 'w') as outfile:`
	`48`	`+ json.dump(submission, outfile)`
	`49`	`+`
	`50`	`+setup(src_url)`
	`51`	`+json_dump(submission)`
	`52`	`+`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 12a74a5

File tree

6 files changed

6 files changed

`‎projects/News website scraper/README.md‎`

`‎projects/News website scraper/images/README.md‎`

`‎projects/News website scraper/images/home.JPG‎`

`‎projects/News website scraper/images/nextpage.JPG‎`

`‎projects/News website scraper/images/result.JPG‎`

`‎projects/News website scraper/moneycontrol_scrapper.py‎`

0 commit comments