Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ScraperHub/web-scraper-with-gemini-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

3 Commits

Repository files navigation

smart-proxy-cta

πŸ€– Gemini AI Web Scraper with Python

This repository shows how to build a Gemini-powered web scraper using Python and LLMs to extract structured data from complex web pages β€” without writing custom parsing logic.

πŸ“– Read the full tutorial β†’ How to Leverage Gemini AI for Web Scraping

✨ What It Does

  • Fetches HTML from any public webpage
  • Converts HTML to Markdown using markdownify
  • Sends it to Gemini AI with a natural language prompt
  • Extracts structured data in JSON format

🧰 Tech Stack

  • google-generativeai – Gemini API for LLM-powered parsing
  • requests – For basic HTTP requests (if not using a proxy)
  • beautifulsoup4 – For basic HTML parsing (optional)
  • markdownify – Converts HTML into cleaner Markdown
  • python-dotenv – For managing API keys and environment variables

πŸ“¦ Installation

  1. Clone this repo:
git clone https://github.com/yourusername/gemini-ai-web-scraper.git
cd gemini-ai-web-scraper
  1. Install dependencies:
pip install google-generativeai python-dotenv requests beautifulsoup4 markdownify
  1. Add your Gemini API Key in the script or as environment variable.

πŸš€ Scale Scraping with Crawlbase Smart Proxy

Web scraping with Gemini AI can hit blocks, CAPTCHAs, and anti-bot systems. Crawlbase Smart Proxy solves that.

βœ… Why Use It?

  • Avoid IP blocks with automatic rotation
  • Bypass CAPTCHAs seamlessly
  • Skip proxy management
  • Get clean, parsed HTML for better AI input

πŸ”§ Example Usage

import requests
import time
proxy_url = "http://_USER_TOKEN_@smartproxy.crawlbase.com:8012"
proxies = {"http": proxy_url, "https": proxy_url}
url = "https://example.com/protected-page"
time.sleep(2) # Mimic human behavior
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)

Replace _USER_TOKEN_ with your Crawlbase Smart Proxy token. Get one after signup on Crawlbase.

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /