代码拉取完成,页面将自动刷新

扫描微信二维码支付

取消

支付完成

richgiteeai

Watch

不关注关注所有动态仅关注版本发行动态关注但不提醒动态

1 Star 0 Fork 30

橙子/PulsarRPA

forked from platon.AI/Browser4

代码 Issues 0 Pull Requests 0 Wiki 统计流水线

服务

加入 Gitee

与超过 1400万开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)

免费加入

已有帐号? 立即登录

克隆/下载

HTTPS SSH SVN SVN+SSH 下载ZIP

提示

下载代码请复制以下命令到终端执行

为确保你提交的代码身份被 Gitee 正确识别,请执行以下命令完成配置

git config --global user.name userName 
git config --global user.email userEmail

初次使用 SSH 协议进行代码克隆、推送等操作时,需按下述提示完成 SSH 配置

1 生成 RSA 密钥

2 获取 RSA 公钥内容,并配置到 SSH公钥中

在 Gitee 上使用 SVN,请访问使用指南

使用 HTTPS 协议时,命令行会出现如下账号密码验证步骤。基于安全考虑,Gitee 建议配置并使用私人令牌替代登录密码进行克隆、推送等操作

Username for 'https://gitee.com': userName

Password for 'https://userName@gitee.com': # 私人令牌

新建文件新建 Diagram 文件

新建子模块

上传文件

分支 41

标签 73

贡献代码

同步代码

创建 Pull Request

了解更多

对比差异通过 Pull Request 同步

同步更新到分支

通过 Pull Request 同步

将会在向当前分支创建一个 Pull
Request,合入后将完成同步

platon.AI Fix: 1. NavigateEntry's state should be sy... 6a17c4b

1942 次提交

.github/workflows

bin

docs

pulsar-ql

pulsar-spring-support

pulsar-third

pulsar-tools

.gitattributes

.gitignore

CHANGES.md

DEVPLANS.md

LICENSE

README-CN.adoc

README.adoc

RELEASE.md

VERSION

cloc.sh

pom.xml

README

AGPL-3.0

PulsarRPA 简介

Table of Contents

🚄 开始
🥁 简介
🚀 主要特性
♾ 核心概念
🧮 通过可执行 jar 使用 PulsarRPA
🎁 将 PulsarRPA 用作软件库
🌐 将 PulsarRPA 作为 REST 服务运行
- 从源代码构建
- 使用 X-SQL 查询 Web
📖 循序渐进的课程
📊 日志和指标
💻 系统要求
🛸 高级主题
🆚 同其他方案的对比
🤓 技术细节
🐦 联系方式

English | 简体中文 | 中国镜像

🚄 开始

💖 PulsarRPA 是你唯一的需要!💖

PulsarRPA 是大规模采集 Web 数据的终极开源方案,可满足几乎所有规模和性质的网络数据采集需要。

大规模提取 Web 数据非常困难。网站经常变化并且变得越来越复杂,这意味着收集的网络数据通常不准确或不完整,PulsarRPA 开发了一系列尖端技术来解决这些问题。

大多数抓取尝试可以从几乎一行代码开始:

Kotlin:

fun main() = PulsarContexts.createSession().scrapeOutPages(
 "https://www.amazon.com/", "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

上面的代码从一组产品页面中抓取由 css 选择器 #title 和 #acrCustomerReviewText 指定的字段。示例代码:kotlin, java.

大多数生产环境数据采集项目可以从以下代码片段开始:

Kotlin:

fun main() {
 val context = PulsarContexts.create()

 val parseHandler = { _: WebPage, document: FeaturedDocument ->
 // use the document
 // ...
 // and then extract further hyperlinks
 context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
 }
 val urls = LinkExtractors.fromResource("seeds10.txt")
 .map { ParsableHyperlink("$it -refresh", parseHandler) }
 context.submitAll(urls).await()
}

示例代码: kotlin, java.

最复杂的数据采集项目需要使用 RPA:

Kotlin:

val options = session.options(args)
val event = options.event.browseEvent
event.onBrowserLaunched.addLast { page, driver ->
 // warp up the browser to avoid being blocked by the website,
 // or choose the global settings, such as your location.
 warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
 // have to visit a referrer page before we can visit the desired page
 waitForReferrer(page, driver)
 // websites may prevent us from opening too many pages at a time, so we should open links one by one.
 waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
 // wait for a special fields to appear on the page
 driver.waitForSelector("body h1[itemprop=name]")
 // close the mask layer, it might be promotions, ads, or something else.
 driver.click(".mask-layer-close-button")
}
// visit the URL and trigger events
session.load(url, options)

示例代码: kotlin.

最复杂的 Web 数据抽取难题需要用 X-SQL 来解决:

您的 Web 数据提取规则非常复杂,例如,每个单独的页面有 100 多个规则
需要维护的数据提取规则很多,比如全球 20 多个亚马逊网站,每个网站 20 多个数据类型

select
 dom_first_text(dom, '#productTitle') as title,
 dom_first_text(dom, '#bylineInfo') as brand,
 dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price,
 dom_first_text(dom, '#acrCustomerReviewText') as ratings,
 str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
 from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');

示例代码:

X-SQLs to scrape all types of Amazon webpages

🥁 简介

PulsarRPA 是大规模采集 Web 数据的终极开源方案,可满足几乎所有规模和性质的网络数据采集需要。

我们发布了一些最大型电商网站的全站数据采集的完整解决方案,这些解决方案满足最高标准的性能、质量和成本要求,他们将永久免费并开放源代码,譬如:

Exotic Walmart
预览版本。

🚀 主要特性

网络爬虫:各种数据采集模式,包括浏览器渲染、ajax数据采集、普通协议采集等
RPA:机器人流程自动化、模仿人类行为、采集单网页应用程序或执行其他有价值的任务
简洁的 API:一行代码抓取,或者一条 SQL 将整个网站栏目变成表格
X-SQL:扩展 SQL 来管理 Web 数据:网络爬取、数据采集、Web 内容挖掘、Web BI
爬虫隐身:浏览器驱动隐身,IP 轮换,隐私上下文轮换,永远不会被屏蔽
高性能:高度优化,单机并行渲染数百页而不被屏蔽
低成本:每天抓取 100,000 个浏览器渲染的电子商务网页,或 n * 10,000,000 个数据点,仅需要 8 核 CPU/32G 内存
数据质量保证:智能重试、精准调度、Web 数据生命周期管理
大规模采集:完全分布式,专为大规模数据采集而设计
大数据支持:支持各种后端存储:本地文件/MongoDB/HBase/Gora
日志和指标:密切监控并记录每个事件
[预览] 信息提取:自动学习网页数据模式,以显著的精度自动提取网页中的每一个字段

♾ 核心概念

PulsarRPA 的核心概念包括以下内容,了解了这些核心概念,您可以使用 PulsarRPA 解决最高要求的数据采集任务:

网络数据采集(Web Scraping): 使用机器人从网站中提取内容和数据的过程
自动提取(Auto Extract): 自动学习数据模式并从网页中提取每个字段,由尖端的人工智能解决算法驱动
RPA: 机器人流程自动化,这是抓取现代网页的唯一方法
网络即数据库(Network As A Database): 像访问本地数据库一样访问 Web
X-SQL: 直接使用 SQL 查询 Web
Pulsar Session: 提供了一组简单、强大和灵活的 API 来执行 Web 抓取任务
Web Driver: Web 驱动定义了一个简洁的界面来访问网页并与之交互,所有行为都经过优化以尽可能接近真实的人
URL: PulsarRPA 中的 URL 是一个普通的 URL,但是带有描述任务的额外信息。PulsarRPA 中的每个任务都被定义为某种形式的 URL
Hyperlink: PulsarRPA 中的超链接是一个普通的超链接,但是带有描述任务的额外信息
Load Options: 加载选项或加载参数影响 PulsarRPA 如何加载、获取和抓取网页
Event Handlers: 在网页的整个生命周期中捕获和处理事件

点击 PulsarRPA concepts 查看详情。

🧮 通过可执行 jar 使用 PulsarRPA

我们发布了一个基于 PulsarRPA 的独立可执行 jar,它包含:

一组顶尖站点的数据采集示例
基于 自监督机器学习 自动进行信息提取的小程序,AI 算法识别详情页的所有字段,95% 以上字段精确度 99% 以上
基于 自监督机器学习 自动学习并输出所有采集规则的小程序
从命令行直接执行网页数据采集任务,不需要写代码
PulsarRPA 服务器,我们可以向服务器发送 SQL 来采集 Web 数据
一个 Web UI,从中我们可以编写 SQL 并将它们发送到服务器

下载 🎁 将 PulsarRPA 用作软件库

利用 PulsarRPA 强大功能的最简单方法是将其作为库添加到您的项目中。

Maven:

<dependency>
 <groupId>ai.platon.pulsar</groupId>
 <artifactId>pulsar-all</artifactId>
 <version>1.10.22</version>
</dependency>

Gradle:

implementation("ai.platon.pulsar:pulsar-all:1.10.22")

也可以从 github.com 克隆模板项目: java-11, 这个指导来加速构建。

基本用法

Kotlin:

// Create a pulsar session
val session = PulsarContexts.createSession()
// The main url we are playing with
val url = "https://www.amazon.com/dp/B09V3KXJPB"

// Load a page from local storage, or fetch it from the Internet if it does not exist or has expired
val page = session.load(url, "-expires 10s")

// Submit a url to the URL pool, the submitted url will be processed in a crawl loop
session.submit(url, "-expires 10s")

// Parse the page content into a document
val document = session.parse(page)
// do something with the document
// ...

// Load and parse
val document2 = session.loadDocument(url, "-expires 10s")
// do something with the document
// ...

// Load the portal page and then load all links specified by `-outLink`.
// Option `-outLink` specifies the cssSelector to select links in the portal page to load.
// Option `-topLinks` specifies the maximal number of links selected by `-outLink`.
val pages = session.loadOutPages(url, "-expires 10s -itemExpires 10s -outLink a[href~=/dp/] -topLinks 10")

// Load the portal page and submit the out links specified by `-outLink` to the URL pool.
// Option `-outLink` specifies the cssSelector to select links in the portal page to submit.
// Option `-topLinks` specifies the maximal number of links selected by `-outLink`.
session.submitOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=/dp/] -topLinks 10")

// Load, parse and scrape fields
val fields = session.scrape(url, "-expires 10s", "#centerCol",
 listOf("#title", "#acrCustomerReviewText"))

// Load, parse and scrape named fields
val fields2 = session.scrape(url, "-i 10s", "#centerCol",
 mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

// Load, parse and scrape named fields
val fields3 = session.scrapeOutPages(url, "-i 10s -ii 10s -outLink a[href~=/dp/] -topLink 10", "#centerCol",
 mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

// Add `-parse` option to activate the parsing subsystem
val page10 = session.load(url, "-parse -expires 10s")

// Kotlin suspend calls
val page11 = runBlocking { session.loadDeferred(url, "-expires 10s") }

// Java-style async calls
session.loadAsync(url, "-expires 10s").thenApply(session::parse).thenAccept(session::export)

示例代码: kotlin, java.

Load options

请注意,我们的大多数抓取方法都接受一个称为加载参数或加载选项的参数,以控制如何加载/获取网页。

-expires // The expiry time of a page
-itemExpires // The expiry time of item pages in batch scraping methods
-outLink // The selector of out links to scrape
-refresh // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse // Activate parse subsystem
-resource // Fetch the url as a resource without browser rendering

点击 Load Options 查看详情。

提取网页数据

PulsarRPA 使用 selector-syntax 以获取所有受支持的 CSS 选择器。

Kotlin:

val document = session.loadDocument(url, "-expires 1d")
val price = document.selectFirst('.price').text()

连续采集

在 PulsarRPA 中抓取大量 url 集合或运行连续采集非常简单。

Kotlin:

fun main() {
 val context = PulsarContexts.create()

 val parseHandler = { _: WebPage, document: FeaturedDocument ->
 // do something wonderful with the document
 System.out.println(document.getTitle() + "\t|\t" + document.getBaseUri());
 }
 val urls = LinkExtractors.fromResource("seeds.txt")
 .map { ParsableHyperlink("$it -refresh", parseHandler) }
 context.submitAll(urls)
 // feel free to submit millions of urls here
 context.submitAll(urls)
 // ...
 context.await()
}

Java:

public class ContinuousCrawler {

 private static void onParse(WebPage page, FeaturedDocument document) {
 // do something wonderful with the document
 System.out.println(document.getTitle() + "\t|\t" + document.getBaseUri());
 }

 public static void main(String[] args) {
 PulsarContext context = PulsarContexts.create();

 List<Hyperlink> urls = LinkExtractors.fromResource("seeds.txt")
 .stream()
 .map(seed -> new ParsableHyperlink(seed, ContinuousCrawler::onParse))
 .collect(Collectors.toList());
 context.submitAll(urls);
 // feel free to submit millions of urls here
 context.submitAll(urls);
 // ...
 context.await();
 }
}

示例代码: kotlin, java.

👽 RPA (机器人流程自动化)

随着网站变得越来越复杂,RPA 已成为从某些网站收集数据的唯一途径,例如某些使用自定义字体技术的网站。

PulsarRPA 包含一个 RPA 子系统,提供了一种在网页生命周期中模仿真人的便捷方式,使用 Web 驱动程序与网页交互:滚动、打字、屏幕捕获、鼠标拖放、点击等。这和大家所熟知的 selenium,playwright,puppeteer 类似,不同的是,PulsarRPA 的所有行为都针对大规模数据采集进行优化。

以下是一个典型的 RPA 代码片段,它是从顶级电子商务网站收集数据所必需的:

val options = session.options(args)
val event = options.event.browseEvent
event.onBrowserLaunched.addLast { page, driver ->
 // 预热浏览器,以避免被网站阻止,或选择全局设置,例如您的位置
 warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
 // 必须先访问引荐来源页面,然后才能访问所需页面
 waitForReferrer(page, driver)
 // 网站可能会阻止我们同时打开过多页面,因此我们应该逐一打开链接
 waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
 // 等待特殊字段出现在页面上
 driver.waitForSelector("body h1[itemprop=name]")
 // 关闭遮罩层,它可能是促销、广告或其他东西
 driver.click(".mask-layer-close-button")
}
// 访问 URL 并触发事件
session.load(url, options)

The example code can be found here: kotlin。

使用 X-SQL 查询 Web

提取单个页面:

select
 dom_first_text(dom, '#productTitle') as title,
 dom_first_text(dom, '#bylineInfo') as brand,
 dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price,
 dom_first_text(dom, '#acrCustomerReviewText') as ratings,
 str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
 from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1s -njr 3', 'body');

执行 X-SQL:

val context = SQLContexts.create()
val rs = context.executeQuery(sql)
println(ResultSetFormatter(rs, withHeader = true))

结果如下:

TITLE | BRAND | PRICE | RATINGS | SCORE
HUAWEI P20 Lite (32GB + 4GB RAM) 5.84" FHD+ Display ... | Visit the HUAWEI Store | 6ドル.10 | 1,349 ratings | 4.40

示例代码: kotlin.

点击 X-SQL 查看关于 X-SQL 的详细介绍和函数说明。

🌐 将 PulsarRPA 作为 REST 服务运行

当 PulsarRPA 作为 REST 服务运行时,X-SQL 可用于随时随地抓取网页或直接查询 Web 数据,无需打开 IDE。

从源代码构建

git clone https://github.com/platonai/pulsar.git
cd pulsar && bin/build-run.sh

对于国内开发者,我们强烈建议您按照这个指导来加速构建。

使用 X-SQL 查询 Web

如果未启动,则启动 pulsar 服务器:

bin/pulsar

在另一个终端窗口中抓取网页:

bin/scrape.sh

该 bash 脚本非常简单,只需使用 curl 发送 X-SQL:

curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d "
 select
 dom_base_uri(dom) as url,
 dom_first_text(dom, '#productTitle') as title,
 str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category,
 dom_first_slim_html(dom, '#bylineInfo') as brand,
 cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery,
 dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img,
 dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice,
 dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price,
 str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
 from load_and_select('https://www.amazon.com/dp/B09V3KXJPB -i 1d -njr 3', 'body');"

示例代码: bash, batch, java, kotlin, php.

Json 格式的响应如下:

{
"uuid":"cc611841-1f2b-4b6b-bcdd-ce822d97a2ad",
"statusCode":200,
"pageStatusCode":200,
"pageContentBytes":1607636,
"resultSet":[
{
"title":"Tara Toys Ariel Necklace Activity Set - Amazon Exclusive (51394)",
"listprice":"19ドル.99",
"price":"12ドル.99",
"categories":"Toys & Games|Arts & Crafts|Craft Kits|Jewelry",
"baseuri":"https://www.amazon.com/dp/B09V3KXJPB"
}
],
"pageStatus":"OK",
"status":"OK"
}

点击 X-SQL 查看关于 X-SQL 的详细介绍和函数说明。

📖 循序渐进的课程

我们有一个循序渐进的示例课程:

BasicUsage
LoadOptions
URLs
JvmAsync
Flow
ContinuousCrawler
EventHandler
RPA
WebDriver
MassiveCrawler
X-SQL

📊 日志和指标

PulsarRPA 精心设计了日志和指标子系统,以记录系统中发生的每一个事件。

PulsarRPA 在日志中报告每个页面加载任务执行的状态,因此很容易知道系统中发生了什么,判断系统运行是否健康、回答成功获取多少页面、重试多少页面、使用了多少代理 IP。

只需注意几个符号,您就可以深入了解整个系统的状态:💯 💔 🗙 ⚡ 💿 🔃 🤺。

下面是一组典型的页面加载日志,查看日志格式了解如何阅读日志,从而一目了然地了解整个系统的状态。

2022年09月24日 11:46:26.045 INFO [-worker-14] a.p.p.c.c.L.Task - 3313. 💯 ⚡ U for N got 200 580.92 KiB in 1m14.277s, fc:1 | 75/284/96/277/6554 | 106.32.12.75 | 3xBpaR2 | https://www.walmart.com/ip/Restored-iPhone-7-32GB-Black-T-Mobile-Refurbished/329207863 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022年09月24日 11:46:09.190 INFO [-worker-32] a.p.p.c.c.L.Task - 3738. 💯 💿 U got 200 452.91 KiB in 55.286s, last fetched 9h32m50s ago, fc:1 | 49/171/82/238/6172 | 121.205.220.179 | https://www.walmart.com/ip/Boost-Mobile-Apple-iPhone-SE-2-Cell-Phone-Black-64GB-Prepaid-Smartphone/490934488 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022年09月24日 11:46:28.567 INFO [-worker-17] a.p.p.c.c.L.Task - 2269. 💯 🔃 U for SC got 200 565.07 KiB <- 543.41 KiB in 1m22.767s, last fetched 16m58s ago, fc:6 | 58/230/98/295/6272 | 27.158.125.76 | 9uwu602 | https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-11-64GB-Purple-Prepaid-Smartphone/356345388?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022年09月24日 11:47:18.390 INFO [r-worker-8] a.p.p.c.c.L.Task - 3732. 💔 ⚡ U for N got 1601 0 <- 0 in 32.201s, fc:1/1 Retry(1601) rsp: CRAWL, rrs: EMPTY_0B | 2zYxg52 | https://www.walmart.com/ip/Apple-iPhone-7-256GB-Jet-Black-AT-T-Locked-Smartphone-Grade-B-Used/182353175?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022年09月24日 11:47:13.860 INFO [-worker-60] a.p.p.c.c.L.Task - 2828. 🗙 🗙 U for SC got 200 0 <- 348.31 KiB <- 684.75 KiB in 0s, last fetched 18m55s ago, fc:2 | 34/130/52/181/5747 | 60.184.124.232 | 11zTa0r2 | https://www.walmart.com/ip/Walmart-Family-Mobile-Apple-iPhone-11-64GB-Black-Prepaid-Smartphone/209201965?athbdg=L1200 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000