Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 8302b5b

Browse files
Merge pull request avinashkranjan#1075 from Ayushjain2205/medium-scraper
Medium scraper
2 parents 42a7c46 + 5aee833 commit 8302b5b

File tree

3 files changed

+115
-0
lines changed

3 files changed

+115
-0
lines changed

‎Medium-Scraper/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Medium Scrapper
2+
Running this Script would allow the user to scrape any number of articles from [medium.com](https://medium.com/) from any category as per the user's choice
3+
4+
## Setup instructions
5+
In order to run this script, you need to have Python and pip installed on your system. After you're done installing Python and pip, run the following command from your terminal to install the requirements from the same folder (directory) of the project.
6+
```
7+
pip install -r requirements.txt
8+
```
9+
As this script uses selenium, you will need to install the chrome webdriver from [this link](https://sites.google.com/a/chromium.org/chromedriver/downloads)
10+
11+
After satisfying all the requirements for the project, Open the terminal in the project folder and run
12+
```
13+
python scraper.py
14+
```
15+
or
16+
```
17+
python3 scraper.py
18+
```
19+
depending upon the python version. Make sure that you are running the command from the same virtual environment in which the required modules are installed.
20+
21+
## Output
22+
The user needs to enter Category and Number of articles
23+
24+
![User is asked for input](https://i.postimg.cc/V6ZGDn8V/output1.png)
25+
26+
The scraped pdf files get saved in the folder in which the script is run
27+
28+
![Files saved in folder](https://i.postimg.cc/J7DVS42k/output2.png)
29+
30+
## Author
31+
[Ayush Jain](https://github.com/Ayushjain2205)

‎Medium-Scraper/requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
requests
2+
beautifulsoup4
3+
selenium
4+
fpdf

‎Medium-Scraper/scraper.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import requests
2+
from bs4 import BeautifulSoup
3+
from selenium import webdriver
4+
from selenium.webdriver.common.keys import Keys
5+
import time
6+
from fpdf import FPDF
7+
8+
# Get input for category and number of articles
9+
category = input("Enter category (Ex- Programming or javascript) : ")
10+
number_articles = int(input("Enter number of articles: "))
11+
driver_path = input("Enter chrome driver path: ")
12+
13+
url = 'https://medium.com/topic/{}'.format(category)
14+
15+
# initiating the webdriver to run in incognito mode
16+
chrome_options = webdriver.ChromeOptions()
17+
chrome_options.add_argument("--incognito")
18+
driver = webdriver.Chrome(driver_path, options=chrome_options)
19+
driver.get(url)
20+
21+
# this is just to ensure that the page is loaded
22+
time.sleep(5)
23+
html = driver.page_source
24+
25+
# Now apply bs4 to html variable
26+
soup = BeautifulSoup(html, "html.parser")
27+
articles = soup.find_all('section')
28+
29+
# Getting articles from medium
30+
num = number_articles
31+
for article in articles:
32+
article_data = article.find('a')['href']
33+
if article_data[0] == '/':
34+
article_data = 'https://medium.com' + article_data
35+
36+
post_url = article_data
37+
driver.get(post_url)
38+
time.sleep(5)
39+
40+
post_html = driver.page_source
41+
soup = BeautifulSoup(post_html, "html.parser")
42+
a_tags = soup.find_all('a')
43+
44+
author = a_tags[2].text
45+
46+
title = soup.find('h1').text.strip()
47+
section = soup.find_all('section')[1]
48+
p_tags = section.find_all('p')
49+
50+
title_string = (title).encode(
51+
'latin-1', 'replace').decode('latin-1')
52+
author_string = (author).encode('latin-1', 'replace').decode('latin-1')
53+
54+
# Add a page in pdf
55+
pdf = FPDF()
56+
pdf.add_page()
57+
# set style and size of font for pdf
58+
pdf.set_font("Arial", size=12)
59+
60+
# Title cell
61+
pdf.cell(200, 5, txt=title_string, ln=1, align='C')
62+
# Author cell
63+
pdf.cell(200, 10, txt=author_string, ln=2, align='C')
64+
65+
for p_tag in p_tags:
66+
article_part = (p_tag.text.strip()).encode('latin-1', 'replace').decode('latin-1')
67+
article_part += '\n'
68+
# Add part of article to pdf
69+
pdf.multi_cell(0, 5, txt=article_part, align='L')
70+
71+
# save the pdf with name .pdf
72+
pdf_title = ''.join(e for e in title if e.isalnum())
73+
pdf.output("{}.pdf".format(pdf_title))
74+
75+
num = num-1
76+
if(num == 0):
77+
break
78+
79+
80+
driver.close() # closing the webdriver

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /