Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Not actually scraping/no response #507

Closed Unanswered
Chris-421 asked this question in Q&A
Discussion options

Hi so i am starting out with this project to scrape some data from the following website: jumbo.com. however, I am not getting the response. the code is basically this tutorial and only adding headless: False and changing both the link and prompt.

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
 "llm": {
 "model": "ollama/llama3.1",
 "temperature": 0,
 "format": "json",
 "base_url": "http://localhost:11434",
 },
 "embeddings": {
 "model": "ollama/nomic-embed-text",
 "temperature": 0,
 "base_url": "http://localhost:11434",
 },
 "verbose": True,
 "headless": False,
 "loader_kwargs": {
 "proxy" : {
 "server": "broker",
 "criteria": {
 "anonymous": True,
 "secure": True,
 "countryset": {"IT"},
 "timeout": 10.0,
 "max_shape": 3
 },
 },
 },
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
 prompt="list me all categories of the products and corresponding links to these categories",
 source="https://www.jumbo.com/producten/",
 config=graph_config,
)
# Run the scraper graph
result = smart_scraper_graph.run()
print("Scraper Result:", result)
graph_exec_info = smart_scraper_graph.get_execution_info()
print(graph_exec_info)

This however does not generate the expected response. instead of an expected list of product categories and weblinks i get:
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.jumbo.com/producten/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Processing chunks: 0%| | 0/1 [00:28<?, ?it/s]
Scraper Result: {'type': 'accordion', 'title': 'Openingstijden', 'content': 'https://www.jumbo.com/winkels'}
exec_info: [{'node_name': 'Fetch', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 83.95596218109131}, {'node_name': 'Parse', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 0.00398707389831543}, {'node_name': 'GenerateAnswer', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 28.795868158340454}, {'node_name': 'TOTAL RESULT', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 112.75581741333008}]

I also tested the tutorial itself (aka the original prompt and link) which only results in one video with title:
Scraper Result: {'type': 'video', 'title': 'Tech Support: Pyrotechnician Answers Fireworks Questions From Twitter', 'description': 'WIRED is where tomorrow is realized. It is the essential source of information and ideas that make sense of a world in constant transformation.', 'url': 'https://www.wired.com/video/watch/tech-support-pyrotechnician-answers-fireworks-questions-from-twitter'} with a similar exec info with 0 tokens.
What am i doing wrong? Is the code incorrect or is my llm setup not working or what?
PS: from similar discussion i found out it might be due to blockers, so i tried other sites, including wikipedea. however the results were still not matching the prompt or the tutorial's. additionally these blockers should theoretically be circumvented using the proxy and headless: False right?

You must be logged in to vote

Replies: 1 comment

Comment options

ok please update to the new version

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /