Commit acd172c

authored

Merge pull request avinashkranjan#674 from paulamib123/news_scrapper

News scrapper

2 parents 426c726 + f52ab4d commit acd172cCopy full SHA for acd172c

File tree

2 files changed

+127

-0

lines changed

News_Scrapper
- README.md
- scrapper.py

2 files changed

+127

-0

lines changed

`‎News_Scrapper/README.md‎`

Lines changed: 39 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,39 @@`
	`1`	`+# News Scrapper (India Today)`
	`2`	`+`
	`3`	`+This Script is used to get top 10 headlines from India Today based on the category entered by user and stores it in a CSV file.`
	`4`	`+`
	`5`	`+## Setup instructions`
	`6`	`+This script requires the following:`
	`7`	`+1. python3`
	`8`	`+2. requests module`
	`9`	`+3. beautifulsoup4 module`
	`10`	`+`
	`11`	`+To install the modules type:`
	`12`	`+`
	`13`	+```bash
	`14`	`+ pip3 install requests`
	`15`	+```
	`16`	`+`
	`17`	+```bash
	`18`	`+ pip3 install beautifulsoup4`
	`19`	+```
	`20`	`+`
	`21`	`+To run the script type:`
	`22`	`+`
	`23`	+```bash
	`24`	`+ cd News_Scrapper`
	`25`	+```
	`26`	`+`
	`27`	+```bash
	`28`	`+ python3 scrapper.py`
	`29`	+```
	`30`	`+`
	`31`	`+## Output`
	`32`	`+`
	`33`	`+![script.png](https://i.postimg.cc/CKRH9rBD/script.png)`
	`34`	`+`
	`35`	`+`
	`36`	`+![excel-csv.png](https://i.postimg.cc/ZKmH0L2t/excel-csv.png)`
	`37`	`+`
	`38`	`+`
	`39`	`+## By [Paulami Bhattacharya](https://github.com/paulamib123)`

`‎News_Scrapper/scrapper.py‎`

Lines changed: 88 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,88 @@`
	`1`	`+from bs4 import BeautifulSoup`
	`2`	`+import requests`
	`3`	`+import csv`
	`4`	`+`
	`5`	`+URL = "https://www.indiatoday.in/"`
	`6`	`+`
	`7`	`+def writeToCSV(topTenNews, category):`
	`8`	`+ with open("topTen" + category + "News.csv", "w") as file:`
	`9`	`+ writer = csv.writer(file)`
	`10`	`+ writer.writerow(["Date", "Link", "Headline"])`
	`11`	`+ for news in topTenNews:`
	`12`	`+ writer.writerow([news[2], "https://www.indiatoday.in/" + news[1], news[0]])`
	`13`	`+`
	`14`	`+def getTopTenFromDivTag(category):`
	`15`	`+ topTenNews = []`
	`16`	`+ count = 0`
	`17`	`+ category_url = URL + category`
	`18`	`+`
	`19`	`+ page = requests.get(category_url)`
	`20`	`+ soup = BeautifulSoup(page.text, "html.parser")`
	`21`	`+`
	`22`	`+ all_div_tags = soup.find_all(class_="detail")`
	`23`	`+`
	`24`	`+ for div in all_div_tags:`
	`25`	`+ count += 1`
	`26`	`+ if count > 10:`
	`27`	`+ break`
	`28`	`+ headline = div.find("h2").text`
	`29`	`+ link = div.find("a").attrs["href"]`
	`30`	`+ date = div.find("a").attrs["href"][-10:]`
	`31`	`+ topTenNews.append([headline, link, date])`
	`32`	`+`
	`33`	`+ return topTenNews`
	`34`	`+`
	`35`	`+def getTopTenFromLiTag(category):`
	`36`	`+ topTenNews = []`
	`37`	`+ count = 0`
	`38`	`+ category_url = URL + category`
	`39`	`+`
	`40`	`+ page = requests.get(category_url)`
	`41`	`+ soup = BeautifulSoup(page.text, "html.parser")`
	`42`	`+`
	`43`	`+ ul_tag = soup.find_all(class_="itg-listing")`
	`44`	`+ ul_tag = str(ul_tag)[25:-6]`
	`45`	`+ li_tags =ul_tag.split("</li>")`
	`46`	`+`
	`47`	`+ for li in li_tags:`
	`48`	`+ count += 1`
	`49`	`+ if count > 10:`
	`50`	`+ break`
	`51`	`+ ele = li.split(">")`
	`52`	`+ link = ele[1].split("=")[1][2:-1]`
	`53`	`+ headline = ele[2][:-3]`
	`54`	`+ date = link[-10:]`
	`55`	`+ topTenNews.append([headline, link, date])`
	`56`	`+`
	`57`	`+ return topTenNews`
	`58`	`+`
	`59`	`+def main():`
	`60`	`+`
	`61`	`+ categories = ["india", "world", "cities", "business", "health", "technology", "sports",`
	`62`	`+ "education", "lifestyle"]`
	`63`	`+`
	`64`	`+ print("Please Choose a Category from the following list")`
	`65`	`+`
	`66`	`+ for index, category in enumerate(categories):`
	`67`	`+ print(str(index + 1) + ". " + category.capitalize())`
	`68`	`+`
	`69`	`+ print("Example: Enter 'world' for top 10 world news")`
	`70`	`+ print()`
	`71`	`+`
	`72`	`+ category = input()`
	`73`	`+ category = category.lower()`
	`74`	`+`
	`75`	`+ if category not in categories:`
	`76`	`+ print("\nPlease choose a valid category!")`
	`77`	`+ exit()`
	`78`	`+`
	`79`	`+ if category in categories[:5]:`
	`80`	`+ topTenNews = getTopTenFromDivTag(category)`
	`81`	`+ else:`
	`82`	`+ topTenNews = getTopTenFromLiTag(category)`
	`83`	`+`
	`84`	`+ writeToCSV(topTenNews, category)`
	`85`	`+`
	`86`	`+ print("Created CSV File Successfully!")`
	`87`	`+`
	`88`	`+main()`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit acd172c

File tree

2 files changed

2 files changed

`‎News_Scrapper/README.md‎`

`‎News_Scrapper/scrapper.py‎`

0 commit comments