-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
-
Hi so i am starting out with this project to scrape some data from the following website: jumbo.com. however, I am not getting the response. the code is basically this tutorial and only adding headless: False and changing both the link and prompt.
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "ollama/llama3.1",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
},
"verbose": True,
"headless": False,
"loader_kwargs": {
"proxy" : {
"server": "broker",
"criteria": {
"anonymous": True,
"secure": True,
"countryset": {"IT"},
"timeout": 10.0,
"max_shape": 3
},
},
},
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="list me all categories of the products and corresponding links to these categories",
source="https://www.jumbo.com/producten/",
config=graph_config,
)
# Run the scraper graph
result = smart_scraper_graph.run()
print("Scraper Result:", result)
graph_exec_info = smart_scraper_graph.get_execution_info()
print(graph_exec_info)
This however does not generate the expected response. instead of an expected list of product categories and weblinks i get:
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.jumbo.com/producten/) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Processing chunks: 0%| | 0/1 [00:28<?, ?it/s]
Scraper Result: {'type': 'accordion', 'title': 'Openingstijden', 'content': 'https://www.jumbo.com/winkels'}
exec_info: [{'node_name': 'Fetch', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 83.95596218109131}, {'node_name': 'Parse', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 0.00398707389831543}, {'node_name': 'GenerateAnswer', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 28.795868158340454}, {'node_name': 'TOTAL RESULT', 'total_tokens': 0, 'prompt_tokens': 0, 'completion_tokens': 0, 'successful_requests': 0, 'total_cost_USD': 0.0, 'exec_time': 112.75581741333008}]
I also tested the tutorial itself (aka the original prompt and link) which only results in one video with title:
Scraper Result: {'type': 'video', 'title': 'Tech Support: Pyrotechnician Answers Fireworks Questions From Twitter', 'description': 'WIRED is where tomorrow is realized. It is the essential source of information and ideas that make sense of a world in constant transformation.', 'url': 'https://www.wired.com/video/watch/tech-support-pyrotechnician-answers-fireworks-questions-from-twitter'} with a similar exec info with 0 tokens.
What am i doing wrong? Is the code incorrect or is my llm setup not working or what?
PS: from similar discussion i found out it might be due to blockers, so i tried other sites, including wikipedea. however the results were still not matching the prompt or the tutorial's. additionally these blockers should theoretically be circumvented using the proxy and headless: False right?
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment
-
ok please update to the new version
Beta Was this translation helpful? Give feedback.