Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 9ca7bd1

Browse files
Merge pull request #1096 from vivekthedev/main
Add new Script StackOverflow Scraper
2 parents b1740fd + ffb593f commit 9ca7bd1

File tree

5 files changed

+100
-0
lines changed

5 files changed

+100
-0
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# StackOverflow Question Scraper
2+
3+
## Aim
4+
5+
The main Aim of the project is to scrape 50 questions from StackOverflow and store it in a serialized format like a JSON file.
6+
7+
## Purpose
8+
9+
The purpose of the project is to provide a fast way in which a user can easily see the top questions based on the tag.
10+
11+
## Setup instructions
12+
13+
- The Script uses BeautifulSoup to scrape contents from the Website.
14+
- To avoid any version change run `pip install -r requirements.txt` in your terminal
15+
- After installing the dependencies run `python scrape.py`
16+
- Enter the tag you want to scrape and the filter and now you are good to go.
17+
18+
19+
## Output
20+
21+
![](./images/execution.png)
22+
<br/><br/><br/>
23+
![](./images/ouput.png)
24+
25+
## Author
26+
27+
[Vivek Kumar Singh](https://github.com/vivekthedev)
30.3 KB
Loading[フレーム]
185 KB
Loading[フレーム]
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
beautifulsoup4==4.11.1
2+
bs4==0.0.1
3+
certifi==2022年9月24日
4+
charset-normalizer==2.1.1
5+
idna==3.4
6+
requests==2.28.1
7+
soupsieve==2.3.2.post1
8+
urllib3==1.26.12
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
from bs4 import BeautifulSoup
2+
import requests
3+
import json
4+
5+
6+
fmt = "https://stackoverflow.com/questions/tagged/{tag}?tab={filter}&pagesize=15"
7+
filters = [
8+
"1. Newest",
9+
"2. Active",
10+
"3. Bounties",
11+
"4. Unanswered",
12+
"5. Frequent",
13+
"6. Votes",
14+
]
15+
16+
tag = input("enter any question tag (python, java)\n")
17+
print("\n".join(filters))
18+
filter = int(input("enter the filter number (1, 3, 5)\n"))
19+
20+
try:
21+
filter = filters[filter].split(" ")[-1]
22+
except:
23+
filter = "Votes"
24+
25+
# generate dynamic URL with user preferences
26+
URL = fmt.format(tag=tag, filter=filter)
27+
28+
print("generated URL ", URL)
29+
content = requests.get(URL).content
30+
31+
soup = BeautifulSoup(content, "lxml")
32+
33+
# return only question tags
34+
def is_question(tag):
35+
try:
36+
return tag.get("id").startswith("question-summary-")
37+
except:
38+
return False
39+
40+
41+
questions = soup.find_all(is_question)
42+
question_data = []
43+
if questions:
44+
# extract question data like votes, title, link and date
45+
for question in questions:
46+
question_dict = {}
47+
question_dict["votes"] = (
48+
question.find(class_="s-post-summary--stats-item-number").get_text().strip()
49+
)
50+
h3 = question.find(class_="s-post-summary--content-title")
51+
question_dict["title"] = h3.get_text().strip()
52+
question_dict["link"] = "https://stackoverflow.com" + h3.find("a").get("href")
53+
question_dict["date"] = (
54+
question.find(class_="s-user-card--time").span.get_text().strip()
55+
)
56+
question_data.append(question_dict)
57+
58+
with open(f"questions-{tag}.json", "w") as f:
59+
json.dump(question_data, f)
60+
61+
print("file exported")
62+
63+
else:
64+
print(URL)
65+
print("looks like there are no questions matching your tag ", tag)

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /