Commit a8ea1bb

authored

Merge pull request avinashkranjan#802 from Ayushjain2205/dev.to-scraper

Dev.to scraper

2 parents cb7a633 + e446b57 commit a8ea1bbCopy full SHA for a8ea1bb

File tree

3 files changed

+116

-0

lines changed

Dev.to Scraper

3 files changed

+116

-0

lines changed

`‎Dev.to Scraper/README.md`

Lines changed: 29 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,29 @@`
	`1`	`+# Dev.to Scrapper`
	`2`	`+Running this Script would allow the user to scrape any number of articles from [dev.to](https://dev.to/) from any category as per the user's choice`
	`3`	`+`
	`4`	`+## Setup instructions`
	`5`	`+In order to run this script, you need to have Python and pip installed on your system. After you're done installing Python and pip, run the following command from your terminal to install the requirements from the same folder (directory) of the project.`
	`6`	+```
	`7`	`+pip install -r requirements.txt`
	`8`	+```
	`9`	`+As this script uses selenium, you will need to install the chrome webdriver from [this link](https://sites.google.com/a/chromium.org/chromedriver/downloads)`
	`10`	`+`
	`11`	`+After satisfying all the requirements for the project, Open the terminal in the project folder and run`
	`12`	+```
	`13`	`+python scraper.py`
	`14`	+```
	`15`	`+or`
	`16`	+```
	`17`	`+python3 scraper.py`
	`18`	+```
	`19`	`+depending upon the python version. Make sure that you are running the command from the same virtual environment in which the required modules are installed.`
	`20`	`+`
	`21`	`+## Output`
	`22`	`+The user needs to enter Category and Number of articles`
	`23`	`+![User is asked for input](https://i.postimg.cc/Qd8YfjXj/dev-scrapper1.png)`
	`24`	`+`
	`25`	`+The scraped pdf files get saved in the folder in which the script is run`
	`26`	`+![Files saved in folder](https://i.postimg.cc/FzXD34W5/dev-scrapper2.png)`
	`27`	`+`
	`28`	`+## Author`
	`29`	`+[Ayush Jain](https://github.com/Ayushjain2205)`

`‎Dev.to Scraper/requirements.txt`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,4 @@`
	`1`	`+requests`
	`2`	`+beautifulsoup4`
	`3`	`+selenium`
	`4`	`+fpdf`

`‎Dev.to Scraper/scraper.py`

Lines changed: 83 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,83 @@`
	`1`	`+import requests`
	`2`	`+from bs4 import BeautifulSoup`
	`3`	`+from selenium import webdriver`
	`4`	`+from selenium.webdriver.common.keys import Keys`
	`5`	`+import time`
	`6`	`+from fpdf import FPDF`
	`7`	`+`
	`8`	`+# Get input for category and number of articles`
	`9`	`+category = input("Enter category: ")`
	`10`	`+number_articles = int(input("Enter number of articles: "))`
	`11`	`+driver_path = input("Enter chrome driver path: ")`
	`12`	`+`
	`13`	`+url = 'https://dev.to/search?q={}'.format(category)`
	`14`	`+`
	`15`	`+# initiating the webdriver. Parameter includes the path of the webdriver.`
	`16`	`+driver = webdriver.Chrome(driver_path)`
	`17`	`+driver.get(url)`
	`18`	`+`
	`19`	`+# this is just to ensure that the page is loaded`
	`20`	`+time.sleep(5)`
	`21`	`+html = driver.page_source`
	`22`	`+`
	`23`	`+# Now apply bs4 to html variable`
	`24`	`+soup = BeautifulSoup(html, "html.parser")`
	`25`	`+results_div = soup.find('div', {'id' : 'substories'})`
	`26`	`+articles = results_div.find_all('article')`
	`27`	`+`
	`28`	`+# Getting articles from dev.to`
	`29`	`+count = 0`
	`30`	`+for article in articles :`
	`31`	`+ article_data=article.find('a',class_='crayons-story__hidden-navigation-link')['href']`
	`32`	`+`
	`33`	`+ post_url="https://dev.to{}".format(article_data)`
	`34`	`+ driver.get(post_url)`
	`35`	`+ time.sleep(5)`
	`36`	`+`
	`37`	`+ post_html=driver.page_source`
	`38`	`+ soup = BeautifulSoup(post_html, "html.parser")`
	`39`	`+ article_div = soup.find('div', {'class' : 'article-wrapper'})`
	`40`	`+ article_content=article_div.find('article', {'id' : 'article-show-container'})`
	`41`	`+`
	`42`	`+ # Title of post found`
	`43`	`+ header_tag = article_content.find('header',class_='crayons-article__header')`
	`44`	`+ title_div = header_tag.find('div',class_='crayons-article__header__meta')`
	`45`	`+ title_content = title_div.find('h1')`
	`46`	`+`
	`47`	`+ # Author of post found`
	`48`	`+ author_tag = title_div.find('div',class_='crayons-article__subheader')`
	`49`	`+ author_name = author_tag.find('a',class_='crayons-link')`
	`50`	`+`
	`51`	`+ # Post content found`
	`52`	`+ article_content_div = article_content.find('div',class_='crayons-article__main')`
	`53`	`+ article_content_body = article_content_div.find('div',class_='crayons-article__body')`
	`54`	`+ p_tags = article_content_body.find_all('p')`
	`55`	`+`
	`56`	`+ title_string = (title_content.text.strip()).encode('latin-1', 'replace').decode('latin-1')`
	`57`	`+ author_string = ("By - {}".format(author_name.text.strip())).encode('latin-1', 'replace').decode('latin-1')`
	`58`	`+`
	`59`	`+ # Add a page`
	`60`	`+ pdf = FPDF()`
	`61`	`+ pdf.add_page()`
	`62`	`+ # set style and size of font`
	`63`	`+ pdf.set_font("Arial", size = 12)`
	`64`	`+`
	`65`	`+ # Title cell`
	`66`	`+ pdf.cell(200, 5, txt = title_string,ln = 1, align = 'C')`
	`67`	`+ # Author cell`
	`68`	`+ pdf.cell(200, 10, txt = author_string,ln = 2, align = 'C')`
	`69`	`+`
	`70`	`+ for p_tag in p_tags:`
	`71`	`+ article_part = (p_tag.text.strip()).encode('latin-1', 'replace').decode('latin-1')`
	`72`	`+ # Add part of article to pdf`
	`73`	`+ pdf.multi_cell(0, 5, txt = article_part, align = 'L')`
	`74`	`+`
	`75`	`+ # save the pdf with name .pdf`
	`76`	`+ pdf_title = ''.join(e for e in title_string if e.isalnum())`
	`77`	`+ pdf.output("{}.pdf".format(pdf_title))`
	`78`	`+`
	`79`	`+ count = count + 1`
	`80`	`+ if(count == number_articles) :`
	`81`	`+ break`
	`82`	`+`
	`83`	`+driver.close() # closing the webdriver`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit a8ea1bb

File tree

3 files changed

3 files changed

`‎Dev.to Scraper/README.md`

`‎Dev.to Scraper/requirements.txt`

`‎Dev.to Scraper/scraper.py`

0 commit comments