This repository was archived by the owner on Dec 22, 2023. It is now read-only.

Commit 0503b98

authored

Merge pull request #309 from jaynarw/master

Add Economic Times News article fetcher

2 parents b36907e + 0b638c5 commit 0503b98Copy full SHA for 0503b98

File tree

3 files changed

+95

-0

lines changed

Scripts/Web_Scrappers/Economictimes_Scraper

3 files changed

+95

-0

lines changed

`‎Scripts/Web_Scrappers/Economictimes_Scraper/README.md`

Lines changed: 22 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,22 @@`
	`1`	`+`
	`2`	`+### Fetch all News Articles from Economictimes starting from a given input start date and end date.`
	`3`	`+`
	`4`	`+### How to use this script?`
	`5`	`+`
	`6`	`+1. Make sure all the requirements for the script are present in your system by running:`
	`7`	`+`
	`8`	+ `pip install -r requirements.txt`
	`9`	`+`
	`10`	`+2. Run the following command:`
	`11`	`+`
	`12`	+ `python economictimes_scraper.py START_DATE END_DATE`
	`13`	+ where date format is `YYYY-MM-DD`
	`14`	`+`
	`15`	`+3. Example Usage`
	`16`	+`python economictimes_scraper.py 2020年05月15日 2020年05月20日`
	`17`	+Output will be saved in file `ET_NewsData_STARTDATE_ENDDATE.json`
	`18`	`+`
	`19`	`+`
	`20`	`+### Author`
	`21`	`+`
	`22`	`+[Jayesh Narwaria](https://github.com/jaynarw)`

`‎Scripts/Web_Scrappers/Economictimes_Scraper/economictimes_scraper.py`

Lines changed: 70 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,70 @@`
	`1`	`+from bs4 import BeautifulSoup`
	`2`	`+import lxml`
	`3`	`+import requests`
	`4`	`+import json`
	`5`	`+import datetime`
	`6`	`+import sys`
	`7`	`+`
	`8`	`+# Util`
	`9`	`+def datestr_to_date(datestr):`
	`10`	`+ [year, month, day] = datestr.split('-')`
	`11`	`+ return datetime.date(`
	`12`	`+ year=int(year),`
	`13`	`+ month=int(month),`
	`14`	`+ day=int(day)`
	`15`	`+ )`
	`16`	`+`
	`17`	`+# Reference dates`
	`18`	`+reference_date = datetime.date(2001, 1, 1) # 2001 Jan 1`
	`19`	`+reference_date_id = 36892`
	`20`	`+`
	`21`	`+if len(sys.argv) < 3:`
	`22`	`+ print('economictimes_scraper.py START_DATE END_DATE\nDate fmt: YYYY-MM-DD')`
	`23`	`+ sys.exit(1)`
	`24`	`+`
	`25`	`+start_date = datestr_to_date(sys.argv[1])`
	`26`	`+end_date = datestr_to_date(sys.argv[2])`
	`27`	`+start_dateid = reference_date_id + (start_date - reference_date).days`
	`28`	`+end_dateid = reference_date_id + (end_date - reference_date).days`
	`29`	`+`
	`30`	`+if (start_date - reference_date).days < 0:`
	`31`	`+ print('Error: Start date should be > than 2001年01月01日')`
	`32`	`+ sys.exit(1)`
	`33`	`+if (end_date - start_date).days < 0:`
	`34`	`+ print('Error: End date should be > than Start date')`
	`35`	`+ sys.exit(1)`
	`36`	`+`
	`37`	`+`
	`38`	`+# Gets News article metadata from article url`
	`39`	`+def fetchNewsArticle(url):`
	`40`	`+ html = requests.get(url).content`
	`41`	`+ root = lxml.HTML(html)`
	`42`	`+ x = root.xpath("/html/body//script[@type='application/ld+json']")`
	`43`	`+ metadata = None ## When Article does not exists (404)`
	`44`	`+ if (len(x) >= 2):`
	`45`	`+ metadata = x[1].text`
	`46`	`+ return metadata`
	`47`	`+`
	`48`	`+et_host = 'https://economictimes.indiatimes.com'`
	`49`	`+et_date_url = 'https://economictimes.indiatimes.com/archivelist/starttime-'`
	`50`	`+et_date_extension = '.cms'`
	`51`	`+`
	`52`	`+fetched_data = {}`
	`53`	`+`
	`54`	`+for dateid in range(start_dateid, end_dateid + 1):`
	`55`	`+ date = str(reference_date + datetime.timedelta(days = dateid-reference_date_id))`
	`56`	`+ html = requests.get('{}{}{}'.format(et_date_url, dateid, et_date_extension)).content`
	`57`	`+ soup = BeautifulSoup(html, 'html.parser')`
	`58`	`+ fetched_data[date] = []`
	`59`	`+ for x in soup.select('#pageContent table li a'):`
	`60`	`+ print(x.text)`
	`61`	`+ article_metadata = fetchNewsArticle(et_host + x['href'])`
	`62`	`+ fetched_data[date].append({`
	`63`	`+ "metadata": article_metadata,`
	`64`	`+ "title": x.text,`
	`65`	`+ "url": et_host + x['href']`
	`66`	`+ })`
	`67`	`+`
	`68`	`+out_filename = 'ET_NewsData_{}_{}.json'.format(start_date, end_date)`
	`69`	`+with open(out_filename, 'w+') as output_file:`
	`70`	`+ output_file.write(json.dumps(fetched_data, indent=2))`

`‎Scripts/Web_Scrappers/Economictimes_Scraper/requirements.txt`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+requests==2.23.0`
	`2`	`+beautifulsoup4==4.9.3`
	`3`	`+lxml==4.5.2`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 0503b98

File tree

3 files changed

3 files changed

`‎Scripts/Web_Scrappers/Economictimes_Scraper/README.md`

`‎Scripts/Web_Scrappers/Economictimes_Scraper/economictimes_scraper.py`

`‎Scripts/Web_Scrappers/Economictimes_Scraper/requirements.txt`

0 commit comments