【开源自荐】一个灵活的 Node.js 多功能爬虫库 —— x-crawl #33

New issue

Open

@coder-hxl

Description

@coder-hxl

coder-hxl

opened

on Mar 20, 2023

x-crawl · npm GitHub license

x-crawl 是一个灵活的 Node.js 多功能爬虫库。灵活的使用方式和众多的功能可以帮助您快速、安全、稳定地爬取页面、接口以及文件。

如果你也喜欢 x-crawl ,可以给 x-crawl 存储库点个 star 支持一下,感谢大家的支持!

GitHub:https://github.com/coder-hxl/x-crawl

特征

🔥 异步同步 - 只需更改一下 mode 属性即可切换异步或同步爬取模式。
⚙️ 多种用途 - 可爬页面、爬接口、爬文件以及轮询爬,满足各种场景需求。
🖋️ 写法灵活 - 同种爬取 API 适配多种配置,每种配置方式都非常独特。
⏱️ 间隔爬取 - 无间隔、固定间隔以及随机间隔,产生或避免高并发爬取。
🔄 失败重试 - 避免因短暂的问题而造成爬取失败,自定义重试次数。
➡️ 轮换代理 - 配合失败重试,自定义错误次数以及 HTTP 状态码自动轮换代理。
👀 设备指纹 - 零配置或自定义配置,避免指纹识别从不同位置识别并跟踪我们。
🚀 优先队列 - 根据单个爬取目标的优先级可以优先于其他目标提前爬取。
☁️ 爬取 SPA - 爬取 SPA(单页应用程序)生成预渲染内容(即"SSR"(服务器端渲染))。
⚒️ 控制页面 - 可以表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
🧾 捕获记录 - 对爬取进行捕获记录,并在终端使用彩色字符串提醒。
🦾 TypeScript - 拥有类型,通过泛型实现完整的类型。

示例

以每天自动获取世界各地的经历和房间的一些照片为例:

// 1.导入模块 ES/CJS
import xCrawl from 'x-crawl'
// 2.创建一个爬虫实例
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })
// 3.设置爬取任务
// 调用 startPolling API 开始轮询功能,每隔一天会调用回调函数
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
 // 调用 crawlPage API 来爬取页面
 const res = await myXCrawl.crawlPage({
 targets: [
 'https://www.airbnb.cn/s/experiences',
 'https://www.airbnb.cn/s/plus_homes'
 ],
 viewport: { width: 1920, height: 1080 }
 })
 // 存放图片 URL 到 targets
 const targets = []
 const elSelectorMap = ['._fig15y', '._aov0j6']
 for (const item of res) {
 const { id } = item
 const { page } = item.data
 // 等待页面加载完成
 await new Promise((r) => setTimeout(r, 300))
 // 获取页面图片的 URL
 const urls = await page!.$$eval(
 `${elSelectorMap[id - 1]} img`,
 (imgEls) => {
 return imgEls.map((item) => item.src)
 }
 )
 targets.push(...urls)
 // 关闭页面
 page.close()
 }
 // 调用 crawlFile API 爬取图片
 await myXCrawl.crawlFile({ targets, storeDir: './upload' })
})

运行效果:

注意: 请勿随意爬取,爬取前可查看 robots.txt 协议。这里只是为了演示如何使用 x-crawl 。

更多内容可查看:https://github.com/coder-hxl/x-crawl

Metadata

Assignees

No one assigned

Labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【开源自荐】一个灵活的 Node.js 多功能爬虫库 —— x-crawl #33

Description

x-crawl · npm GitHub license

特征

示例

更多

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions