gain

Web crawling framework based on asyncio.

[画像:elliotgao2 logo]

Source Code Changelog

Suggest Changes

Popularity

6.0

Stable

Activity

0.0

Stable

Stars 2,028

Watchers 72

Forks 205

Last Commit over 6 years ago

Programming language: Python

License: GNU General Public License v3.0 only

Tags: Web Frameworks Web Crawling Web Content Extracting

gain alternatives and similar packages

Based on the "Web Crawling" category.
Alternatively, view gain alternatives based on common mentions on social networks and blogs.

Scrapy

9.9 9.4 L4 gain VS Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

scrapy logo
pyspider

9.5 0.0 L3 gain VS pyspider

DISCONTINUED. A Powerful Spider(Web Crawler) System in Python.

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.

Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

Promo getstream.io

[画像:Stream Logo]

requests-html

9.1 0.0 gain VS requests-html

Pythonic HTML Parsing for HumansTM

psf logo
portia

8.8 0.0 L2 gain VS portia

Visual scraping for Scrapy

scrapinghub logo
MechanicalSoup

7.7 5.6 L4 gain VS MechanicalSoup

A Python library for automating interaction with websites.

MechanicalSoup logo
RoboBrowser

7.2 0.0 L4 gain VS RoboBrowser

A simple, Pythonic library for browsing the web without a standalone web browser.

jmcarp logo
Grab

6.4 9.2 L3 gain VS Grab

Web Scraping Framework

lorien logo
PSpider

6.4 0.0 gain VS PSpider

简单易用的Python爬虫框架,QQ交流群:597510560

xianhu logo
feedparser

6.3 7.5 L3 gain VS feedparser

Parse feeds in Python

kurtmckee logo
cola

6.3 0.0 L3 gain VS cola

DISCONTINUED. A high-level distributed crawling framework.

qinxuye logo
Scrapely

6.1 0.0 gain VS Scrapely

A pure-python HTML screen-scraping library

scrapy logo
Google Search Results in Python

4.2 0.0 gain VS Google Search Results in Python

Google Search Results via SERP API pip Python Package

serpapi logo
Sukhoi

4.2 0.0 gain VS Sukhoi

Minimalist and powerful Web Crawler.

untwisted logo
MSpider

4.0 0.0 gain VS MSpider

Spider

manning23 logo
reader

3.5 9.4 gain VS reader

A Python feed reader library.

lemon24 logo
spidy Web Crawler

3.3 0.0 gain VS spidy Web Crawler

The simple, easy to use command line web crawler.

rivermont logo
Crawley

2.7 0.0 gain VS Crawley

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

jmg logo
brownant

2.6 0.0 gain VS brownant

Brownant is a web data extracting framework.

douban logo
Demiurge

2.2 0.0 L5 gain VS Demiurge

PyQuery-based scraping micro-framework.

matiasb logo
Pomp

1.7 0.0 L5 gain VS Pomp

Screen scraping and web crawling framework

estin logo
FastImage

1.1 0.0 L4 gain VS FastImage

Python library that finds the size / type of an image given its URI by fetching as little as needed

bmuller logo
Mariner

0.5 0.0 gain VS Mariner

This a is mirror of Gitlab repository. Open your issues and pull requests there.

radek-sprta logo

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of gain or a related project?

Add another 'Web Crawling' Package

InfluxDB – Built for High-Performance Time Series Workloads

featured www.influxdata.com

Popular Comparisons

SaaSHub - Software Alternatives and Reviews

featured www.saashub.com

README

Build Python Version License

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

[](img/architecture.png)

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles
class Post(Item):
 title = Css('.entry-title')
 content = Css('.entry-content')
 async def save(self):
 async with aiofiles.open('scrapinghub.txt', 'a+') as f:
 await f.write(self.results['title'])
class MySpider(Spider):
 concurrency = 5
 headers = {'User-Agent': 'Google Spider'}
 start_url = 'https://blog.scrapinghub.com/'
 parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
 Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]
MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider
class Post(Item):
 title = Css('.breadcrumb_last')
 async def save(self):
 print(self.title)
class MySpider(Spider):
 start_url = 'https://mydramatime.com/europe-and-us-drama/'
 concurrency = 5
 headers = {'User-Agent': 'Google Spider'}
 parsers = [
 XPathParser('//span[@class="category-name"]/a/@href'),
 XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
 XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
 ]
 proxy = 'https://localhost:1234'
MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

[](img/sample.png)

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

*Note that all licence references and agreements mentioned in the gain README section above are relevant to that project's source code only.

Do not miss the trending, packages, news and articles with our weekly report.

Awesome Python is part of the LibHunt network. Terms. Privacy Policy.

(CC)

BY-SA

We recommend Spin The Wheel Of Names for a cryptographically secure random name picker.