32,821 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
2
votes
2
answers
100
views
BeautifulSoup - Extracting content blocks after specific subheadings within a larger section, ignoring document introduction
I am scraping the Dead by Daylight Fandom wiki (specifically TOME pages, e.g., https://deadbydaylight.fandom.com/wiki/Tome_1_-_Awakening) to extract memory logs.
The goal is to extract the Memory ...
0
votes
2
answers
219
views
Beautiful Soup, children are clearly inside but can't get it
From the below structure I only want value of href attribute. But rec_block is returning h5 element without its children so basically <h5 class="series">Recommendations</h5>.
<...
-5
votes
1
answer
101
views
What is missing in selenium code to get complete html code to use with beautifulsoup?
I've recently learned how to webscrape with beautifulsoup and now I'm trying to learn a bit about selenium because I couldn't get correct info with beautifulsoup alone. I think there is javascript ...
-1
votes
2
answers
70
views
Getting element using re.compile with bs4?
i try to find a span element using selenium and bs4 with the following code:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import ...
3
votes
1
answer
171
views
How to clean inconsistent address strings in Python?
I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.
I've already handled simple ...
3
votes
1
answer
62
views
Beautiful Soup; splitting a paragraph only by <br> where stripped_strings is not working
I'm rather new to using Beautiful Soup and I'm having some issues splitting some html correctly by only looking at html breaks and ignoring other html elements such as changes in font color etc.
The ...
1
vote
1
answer
243
views
Trouble scraping dynamic lottery results table – inconsistent parsing
I’ve been trying to scrape lottery results from a website that shows draws. The data is presented in a results table, but I keep running into strange issues where sometimes the numbers are captured ...
0
votes
2
answers
64
views
Get the attribute data by another attribute beautifulsoup
I want to parse the HTML like this below with beautiful soup
.
.
<meta property="og:image" content="https://test.com/test.jp" />
<meta property="og:description" ...
-2
votes
1
answer
73
views
How to use Beautiful Soup to find partial links [closed]
I have an eBay page in which I would like to formulate a list of all the item numbers on that page. I have executed and parsed the HTML content using requests and Beautiful Soup, but I can't figure ...
0
votes
2
answers
111
views
How to use index to find position of JSON record [closed]
Is there a better way than iteration using a for loop to find the index of the record?
My problem is that to use index I seem to need the index of the record I'm seeking.
import json
from bs4 import ...
4
votes
2
answers
287
views
How to reliably download 1969 "Gazzetta Ufficiale" PDFs (Italian Official Gazette) with Python?
I’m trying to programmatically download the full "pubblicazione completa non certificata" PDFs of the Italian Gazzetta Ufficiale – Serie Generale for 1969 (for an academic article). The site has a ...
-1
votes
2
answers
67
views
Puppeteer can't access var doc in javascript
I am trying to scrape a web page using puppeteer, however, I can't access var doc with puppeteer. Although I can see it in the source page of my web browser
var rows = [];
var i = 1;
/* while(i <= ...
0
votes
1
answer
187
views
How can I speed up my Selenium scraper using multiprocessing in Python? [closed]
I'm scraping a large list of URLs (1.2 million) using Selenium + BeautifulSoup with Python's multiprocessing.Pool. I want to scale it up to scrape faster, ideally without hitting system resource ...
-3
votes
1
answer
72
views
Beautifulsoap - reading multiply pages breaks after random valid reads
I'm reading some data about books title etc from number of pages.
Python 3.10.13
Breautifulsoap 4.12.3
Code:
def scrapSite(URL):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT ...
1
vote
1
answer
68
views
How to use BeautifulSoup find_all() to get a class with multiple classes?
I am not sure if the terminology "class with multiple classes" is correct but that is the best I can describe it.
import requests
from bs4 import BeautifulSoup
url="https://curiosa.io/...