Commit 9ca7bd1

authored

Merge pull request #1096 from vivekthedev/main

Add new Script StackOverflow Scraper

2 parents b1740fd + ffb593f commit 9ca7bd1Copy full SHA for 9ca7bd1

File tree

5 files changed

+100

-0

lines changed

WebScrapingScripts/StackOverflow Question Scraper
- README.md
- images
  - execution.png
  - ouput.png
- requirements.txt
- scrape.py

5 files changed

+100

-0

lines changed

`‎WebScrapingScripts/StackOverflow Question Scraper/README.md`

Lines changed: 27 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,27 @@`
	`1`	`+# StackOverflow Question Scraper`
	`2`	`+`
	`3`	`+## Aim`
	`4`	`+`
	`5`	`+The main Aim of the project is to scrape 50 questions from StackOverflow and store it in a serialized format like a JSON file.`
	`6`	`+`
	`7`	`+## Purpose`
	`8`	`+`
	`9`	`+The purpose of the project is to provide a fast way in which a user can easily see the top questions based on the tag.`
	`10`	`+`
	`11`	`+## Setup instructions`
	`12`	`+`
	`13`	`+- The Script uses BeautifulSoup to scrape contents from the Website.`
	`14`	+- To avoid any version change run `pip install -r requirements.txt` in your terminal
	`15`	+- After installing the dependencies run `python scrape.py`
	`16`	`+- Enter the tag you want to scrape and the filter and now you are good to go.`
	`17`	`+`
	`18`	`+`
	`19`	`+## Output`
	`20`	`+`
	`21`	`+![](./images/execution.png)`
	`22`	`+<br/><br/><br/>`
	`23`	`+![](./images/ouput.png)`
	`24`	`+`
	`25`	`+## Author`
	`26`	`+`
	`27`	`+[Vivek Kumar Singh](https://github.com/vivekthedev)`

`‎WebScrapingScripts/StackOverflow Question Scraper/images/execution.png`

30.3 KB

Loading[フレーム]

`‎WebScrapingScripts/StackOverflow Question Scraper/images/ouput.png`

185 KB

Loading[フレーム]

`‎WebScrapingScripts/StackOverflow Question Scraper/requirements.txt`

Lines changed: 8 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,8 @@`
	`1`	`+beautifulsoup4==4.11.1`
	`2`	`+bs4==0.0.1`
	`3`	`+certifi==2022年9月24日`
	`4`	`+charset-normalizer==2.1.1`
	`5`	`+idna==3.4`
	`6`	`+requests==2.28.1`
	`7`	`+soupsieve==2.3.2.post1`
	`8`	`+urllib3==1.26.12`

`‎WebScrapingScripts/StackOverflow Question Scraper/scrape.py`

Lines changed: 65 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,65 @@`
	`1`	`+from bs4 import BeautifulSoup`
	`2`	`+import requests`
	`3`	`+import json`
	`4`	`+`
	`5`	`+`
	`6`	`+fmt = "https://stackoverflow.com/questions/tagged/{tag}?tab={filter}&pagesize=15"`
	`7`	`+filters = [`
	`8`	`+ "1. Newest",`
	`9`	`+ "2. Active",`
	`10`	`+ "3. Bounties",`
	`11`	`+ "4. Unanswered",`
	`12`	`+ "5. Frequent",`
	`13`	`+ "6. Votes",`
	`14`	`+]`
	`15`	`+`
	`16`	`+tag = input("enter any question tag (python, java)\n")`
	`17`	`+print("\n".join(filters))`
	`18`	`+filter = int(input("enter the filter number (1, 3, 5)\n"))`
	`19`	`+`
	`20`	`+try:`
	`21`	`+ filter = filters[filter].split(" ")[-1]`
	`22`	`+except:`
	`23`	`+ filter = "Votes"`
	`24`	`+`
	`25`	`+# generate dynamic URL with user preferences`
	`26`	`+URL = fmt.format(tag=tag, filter=filter)`
	`27`	`+`
	`28`	`+print("generated URL ", URL)`
	`29`	`+content = requests.get(URL).content`
	`30`	`+`
	`31`	`+soup = BeautifulSoup(content, "lxml")`
	`32`	`+`
	`33`	`+# return only question tags`
	`34`	`+def is_question(tag):`
	`35`	`+ try:`
	`36`	`+ return tag.get("id").startswith("question-summary-")`
	`37`	`+ except:`
	`38`	`+ return False`
	`39`	`+`
	`40`	`+`
	`41`	`+questions = soup.find_all(is_question)`
	`42`	`+question_data = []`
	`43`	`+if questions:`
	`44`	`+ # extract question data like votes, title, link and date`
	`45`	`+ for question in questions:`
	`46`	`+ question_dict = {}`
	`47`	`+ question_dict["votes"] = (`
	`48`	`+ question.find(class_="s-post-summary--stats-item-number").get_text().strip()`
	`49`	`+ )`
	`50`	`+ h3 = question.find(class_="s-post-summary--content-title")`
	`51`	`+ question_dict["title"] = h3.get_text().strip()`
	`52`	`+ question_dict["link"] = "https://stackoverflow.com" + h3.find("a").get("href")`
	`53`	`+ question_dict["date"] = (`
	`54`	`+ question.find(class_="s-user-card--time").span.get_text().strip()`
	`55`	`+ )`
	`56`	`+ question_data.append(question_dict)`
	`57`	`+`
	`58`	`+ with open(f"questions-{tag}.json", "w") as f:`
	`59`	`+ json.dump(question_data, f)`
	`60`	`+`
	`61`	`+ print("file exported")`
	`62`	`+`
	`63`	`+else:`
	`64`	`+ print(URL)`
	`65`	`+ print("looks like there are no questions matching your tag ", tag)`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit 9ca7bd1

File tree

5 files changed

5 files changed

`‎WebScrapingScripts/StackOverflow Question Scraper/README.md`

`‎WebScrapingScripts/StackOverflow Question Scraper/images/execution.png`

`‎WebScrapingScripts/StackOverflow Question Scraper/images/ouput.png`

`‎WebScrapingScripts/StackOverflow Question Scraper/requirements.txt`

`‎WebScrapingScripts/StackOverflow Question Scraper/scrape.py`

0 commit comments