Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 8a9853c

Browse files
2 parents 2328db2 + 0171cee commit 8a9853c

File tree

2 files changed

+84
-0
lines changed

2 files changed

+84
-0
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Web Scraping with Beautiful Soup
2+
3+
This script performs web scraping on a CodeChef problem statement webpage using the Beautiful Soup library in Python.
4+
5+
## Description
6+
7+
The Python script utilizes the `requests` and `BeautifulSoup` libraries to extract information from a CodeChef problem statement webpage. It demonstrates the following actions:
8+
9+
- Printing the title of the webpage.
10+
- Finding and printing all links on the page.
11+
- Extracting text from paragraphs.
12+
- Extracting image URLs.
13+
- Counting and categorizing HTML tags.
14+
- Filtering and printing valid links.
15+
- Saving extracted data to a text file.
16+
17+
## Prerequisites
18+
19+
Ensure you have the following libraries installed:
20+
21+
- `requests`
22+
- `beautifulsoup4`
23+
24+
You can install them using the following commands:
25+
26+
```bash
27+
pip install requests beautifulsoup4
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import requests
2+
from bs4 import BeautifulSoup
3+
import re
4+
5+
url = 'https://www.codechef.com/problems/TWORANGES?tab=statement'
6+
response = requests.get(url)
7+
soup = BeautifulSoup(response.content, 'html.parser')
8+
9+
# Print the title of the webpage
10+
print(f"Title: {soup.title.text}\n")
11+
12+
# Find and print all links on the page
13+
print("Links on the page:")
14+
for link in soup.find_all('a'):
15+
print(link.get('href'))
16+
17+
# Extract text from paragraphs
18+
print("\nText from paragraphs:")
19+
for paragraph in soup.find_all('p'):
20+
print(paragraph.text)
21+
22+
# Extract image URLs
23+
print("\nImage URLs:")
24+
for img in soup.find_all('img'):
25+
img_url = img.get('src')
26+
if img_url:
27+
print(img_url)
28+
29+
# Count and categorize tags
30+
print("\nTag counts:")
31+
tag_counts = {}
32+
for tag in soup.find_all():
33+
tag_name = tag.name
34+
if tag_name:
35+
tag_counts[tag_name] = tag_counts.get(tag_name, 0) + 1
36+
37+
for tag, count in tag_counts.items():
38+
print(f"{tag}: {count}")
39+
40+
# Filter and print valid links
41+
print("\nValid links:")
42+
for link in soup.find_all('a'):
43+
href = link.get('href')
44+
if href and re.match(r'^https?://', href):
45+
print(href)
46+
47+
# Save data to a file
48+
with open('webpage_data.txt', 'w') as file:
49+
file.write(f"Title: {soup.title.text}\n\n")
50+
file.write("Links on the page:\n")
51+
for link in soup.find_all('a'):
52+
file.write(f"{link.get('href')}\n")
53+
file.write("\nText from paragraphs:\n")
54+
for paragraph in soup.find_all('p'):
55+
file.write(f"{paragraph.text}\n")
56+
57+
print("\nData saved to 'webpage_data.txt'")

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /