gazpacho

🥫 The simple, fast, and modern web scraping library

[画像:maxhumber logo]

maxhumber.github.io Source Code Docs Changelog

Suggest Changes

Popularity

4.1

Growing

Activity

3.2

Stable

Stars 771

Watchers 16

Forks 56

Last Commit about 2 years ago

Description

gazpacho is a web scraping library. It replaces requests and BeautifulSoup for most projects. gazpacho is small, simple, fast, and consistent. You should use it!

Programming language: Python

License: MIT License

Tags: HTTP HTML Manipulation Web Crawling Web Scraping Scraping Beautifulsoup Requests

Latest version: v1.1

gazpacho alternatives and similar packages

Based on the "HTML Manipulation" category.
Alternatively, view gazpacho alternatives based on common mentions on social networks and blogs.

xmltodict

8.0 8.6 L4 gazpacho VS xmltodict

Python module that makes working with XML feel like you are working with JSON

martinblech logo
lxml

7.0 9.5 L2 gazpacho VS lxml

The lxml XML toolkit for Python

lxml logo

InfluxDB – Built for High-Performance Time Series Workloads

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

Promo www.influxdata.com

[画像:InfluxDB Logo]

xhtml2pdf

6.8 5.6 L1 gazpacho VS xhtml2pdf

A library for converting HTML into PDFs using ReportLab

xhtml2pdf logo
bleach

6.4 6.4 L4 gazpacho VS bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

mozilla logo
pyquery

6.2 5.7 L3 gazpacho VS pyquery

A jquery-like library for python

gawel logo
html5lib

5.3 4.1 L2 gazpacho VS html5lib

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib logo
selectolax

5.1 9.3 gazpacho VS selectolax

Python binding to Modest and Lexbor engines. Fast HTML5 parser with CSS selectors for Python.

rushter logo
MarkupSafe

4.3 7.0 L5 gazpacho VS MarkupSafe

Safely add untrusted strings to HTML/XML markup.

pallets logo
untangle

4.0 2.7 L5 gazpacho VS untangle

Converts XML to Python objects

stchris logo
xmldataset

1.8 0.0 L1 gazpacho VS xmldataset

xmldataset: xml parsing made easy 🗃️

spurin logo
cssutils

1.6 - gazpacho VS cssutils

A CSS library for Python.
BeautifulSoup

- gazpacho VS BeautifulSoup

Providing Pythonic idioms for iterating, searching, and modifying HTML or XML.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of gazpacho or a related project?

Add another 'HTML Manipulation' Package

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

featured getstream.io

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get, Soup
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)
def parse(book):
 name = book.find('h4').text
 price = float(book.find('p').text[1:].split(' ')[0])
 return name, price
[parse(book) for book in books]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get, Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n <head>\n <met'

Adjust get requests with optional params and headers:

get(
 url='https://httpbin.org/anything',
 params={'foo': 'bar', 'bar': 'baz'},
 headers={'User-Agent': 'gazpacho'}
)

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup(html)

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup.get(url)

.find

Use the .find method to target and extract HTML tags:

h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup.find('div', attrs={'class': 'section-'})

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False:

soup.find('div', {'class': 'soup'}, partial=False)

mode=

Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:

print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8

dir()

Soup objects have html, tag, attrs, and text attributes:

dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup

Support

If you use gazpacho, consider adding the scraper: gazpacho badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.