ScraperHub/web-scraper-with-gemini-ai

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env		.env
README.md		README.md
gemini_ai_powered_scraper.py		gemini_ai_powered_scraper.py

Repository files navigation

smart-proxy-cta

🤖 Gemini AI Web Scraper with Python

This repository shows how to build a Gemini-powered web scraper using Python and LLMs to extract structured data from complex web pages — without writing custom parsing logic.

📖 Read the full tutorial → How to Leverage Gemini AI for Web Scraping

✨ What It Does

Fetches HTML from any public webpage
Converts HTML to Markdown using markdownify
Sends it to Gemini AI with a natural language prompt
Extracts structured data in JSON format

🧰 Tech Stack

google-generativeai – Gemini API for LLM-powered parsing
requests – For basic HTTP requests (if not using a proxy)
beautifulsoup4 – For basic HTML parsing (optional)
markdownify – Converts HTML into cleaner Markdown
python-dotenv – For managing API keys and environment variables

📦 Installation

Clone this repo:

git clone https://github.com/yourusername/gemini-ai-web-scraper.git
cd gemini-ai-web-scraper

Install dependencies:

pip install google-generativeai python-dotenv requests beautifulsoup4 markdownify

Add your Gemini API Key in the script or as environment variable.

🚀 Scale Scraping with Crawlbase Smart Proxy

Web scraping with Gemini AI can hit blocks, CAPTCHAs, and anti-bot systems. Crawlbase Smart Proxy solves that.

✅ Why Use It?

Avoid IP blocks with automatic rotation
Bypass CAPTCHAs seamlessly
Skip proxy management
Get clean, parsed HTML for better AI input

🔧 Example Usage

import requests
import time
proxy_url = "http://_USER_TOKEN_@smartproxy.crawlbase.com:8012"
proxies = {"http": proxy_url, "https": proxy_url}
url = "https://example.com/protected-page"
time.sleep(2) # Mimic human behavior
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)

Replace _USER_TOKEN_ with your Crawlbase Smart Proxy token. Get one after signup on Crawlbase.

About

Web Scraper powered by Gemini AI in Python.

crawlbase.com/blog/how-to-leverage-gemini-ai-for-web-scraping/

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ScraperHub/web-scraper-with-gemini-ai

Folders and files

Latest commit

History

Repository files navigation

🤖 Gemini AI Web Scraper with Python

✨ What It Does

🧰 Tech Stack

📦 Installation

🚀 Scale Scraping with Crawlbase Smart Proxy

✅ Why Use It?

🔧 Example Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

ScraperHub/web-scraper-with-gemini-ai

Folders and files

Latest commit

History

Repository files navigation

🤖 Gemini AI Web Scraper with Python

✨ What It Does

🧰 Tech Stack

📦 Installation

🚀 Scale Scraping with Crawlbase Smart Proxy

✅ Why Use It?

🔧 Example Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages