Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on May 25, 2022. It is now read-only.

Commit 12a74a5

Browse files
added news scraper code and readme
1 parent 28f7a0f commit 12a74a5

File tree

6 files changed

+68
-0
lines changed

6 files changed

+68
-0
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Financial-news-scraper
2+
A scraper made using beautiful soup 4 in python. Tailor made for extracting news from moneycontrol.com. Issue pull request for different scrapers.
3+
4+
__The main page to start scraping from: https://www.moneycontrol.com/news/technical-call-221.html__
5+
![](images/home.JPG)
6+
7+
__The program scrapes news from next pages too by extracting website link in these buttons__
8+
![](images/nextpage.JPG)
9+
10+
__Resulting JSON file includes heading, date and image link, indexed by page number__
11+
![](images/result.JPG)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
home.jpg - main news page
2+
3+
nextpage.jpg - links to next pages
4+
5+
result.jpg - Snapshot of result json file
241 KB
Loading[フレーム]
74.6 KB
Loading[フレーム]
287 KB
Loading[フレーム]
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
import re
2+
import json
3+
import requests
4+
import datetime
5+
from tqdm import tqdm
6+
from bs4 import BeautifulSoup
7+
from collections import defaultdict
8+
9+
submission = defaultdict(list)
10+
#main url
11+
src_url = 'https://www.moneycontrol.com/news/technical-call-221.html'
12+
13+
#get next page links and call scrap() on each link
14+
def setup(url):
15+
nextlinks = []
16+
src_page = requests.get(url).text
17+
src = BeautifulSoup(src_page, 'lxml')
18+
19+
#ignore <a> with void js as href
20+
anchors = src.find("div", attrs={"class": "pagenation"}).findAll(
21+
'a', {'href': re.compile('^((?!void).)*$')})
22+
nextlinks = [i.attrs['href'] for i in anchors]
23+
for idx, link in enumerate(tqdm(nextlinks)):
24+
scrap('https://www.moneycontrol.com'+link, idx)
25+
26+
#scraps passed page url
27+
def scrap(url, idx):
28+
src_page = requests.get(url).text
29+
src = BeautifulSoup(src_page, 'lxml')
30+
31+
span = src.find("ul", {"id": "cagetory"}).findAll('span')
32+
img = src.find("ul", {"id": "cagetory"}).findAll('img')
33+
34+
#<img> has alt text attr set as heading of news, therefore get img link and heading from same tag
35+
imgs = [i.attrs['src'] for i in img]
36+
titles = [i.attrs['alt'] for i in img]
37+
date = [i.get_text() for i in span]
38+
39+
#list of dicts as values and indexed by page number
40+
submission[str(idx)].append({'title': titles})
41+
submission[str(idx)].append({'date': date})
42+
submission[str(idx)].append({'img_src': imgs})
43+
44+
#save data as json named by current date
45+
def json_dump(data):
46+
date = datetime.date.today().strftime("%B %d, %Y")
47+
with open('moneycontrol_'+str(date)+'.json', 'w') as outfile:
48+
json.dump(submission, outfile)
49+
50+
setup(src_url)
51+
json_dump(submission)
52+

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /