diff --git a/README.md b/README.md index e335e12..d2a499c 100644 --- a/README.md +++ b/README.md @@ -39,8 +39,21 @@ | 2020年5月8日 | 第10讲:解码之路
摘要:候选解码结果的筛选思路——加字典/常用词词典/高频词排序词典、用循环遍历/列表推导/filter()函数结合lambda内联函数/集合运算四种方式实现数据过滤、用词典缩减/递归深度缩减实现解码优化、字符级与单词级解码、用空间换时间、递归思维、类比思维等。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_10.ipynb)
![第十课思维导图](https://github.com/fly51fly/Practical_Python_Programming/blob/master/images/class_10_001.jpg)| [L10.1](https://www.bilibili.com/video/av92186118?p=25) | | - | 第10讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_010.md))
摘要:递归运行过程分析方法、利用上下文信息进行候选解码结果筛选、如何用while循环代替递归、lambda内联函数还能用于哪些Python内置函数、语法信息是否可用于筛选、filter()函数的用法、用时间换空间vs.用空间换时间、是否可用不同的词频文件、迭代器的使用、iter()函数和\_\_iter\_\_()函数、是否可结合哈夫曼编码、能否用常用句子库解码、自己上手编程困难该怎么办、加快解码速度的思路、生成器的使用、迭代器转换为列表是否会丢失信息、集合的交集并集计算、程序如何实现分级封装、应该如何添加词频文件不包含的词等。 | [L10.2](https://www.bilibili.com/video/av92186118?p=26) | | 2020年5月15日 | 第11讲:反向最大匹配解码与爬虫初步
摘要:基于优先匹配长词的想法,用递归实现反向最大匹配解码;初步了解爬虫的基本概念、基本流程;采、抽、存三步走的简单实现;了解requests使用、回顾正则表达式的配置和re正则库的使用。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_11.ipynb)
![第十一课思维导图](https://github.com/fly51fly/Practical_Python_Programming/blob/master/images/class_11_001.jpg)| [L11.1](https://www.bilibili.com/video/av92186118?p=27) | +| - | 网络爬虫第0课:浏览器背后干的那些事儿
摘要:了解、开发爬虫必须了解的HTTP/HTML知识,从输入网址到看见网页,浏览器背地里干的那些事儿。 | [L11.2](https://www.bilibili.com/video/av92186118?p=28) | +| - | 第11讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_011.md))
摘要:爬虫有什么"高级"应用、什么样的网站是"软柿子"、正向最大匹配能用吗、pyquery与jQuery、爬虫难吗、图片影片怎么爬、玩转正则需要记住吗、网站反爬有哪些手段、爬网页要遵守的规定、递归过程中最小值的定义、网站是不是都能爬、requests的header部分是做什么的、百度网分享链接和分析密码的爬取策略、正则表达式如何匹配"所有符号"、get和post的区别、爬虫与数据可视化、爬虫翻页链接的处理、学爬虫有什么好课程、对课上的"干货"力不从心怎么办、有没有可能不分析页面源码也能做抽取、反爬常用手段、验证码问题怎么解决等。 | [L11.3](https://www.bilibili.com/video/av92186118?p=29) | +| 2020年5月22日 | 第12讲:爬虫框架初步设计与B站豆瓣初步尝试
摘要:构建爬虫基础类(框架),了解框架、脚手架和库的差别,复习类的设计和构建,对B站和豆瓣的排行榜进行采集,构建完整的信息抽取用正则表达式,熟悉RegexBuilder等正则调试工具的使用,用copy as Curl->requests技巧"克隆"浏览器的访问请求。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_12.ipynb)
![第十二课思维导图](https://github.com/fly51fly/Practical_Python_Programming/blob/master/images/class_12_001.jpg)| [L12.1](https://www.bilibili.com/video/av92186118?p=30) | +| - | 第12讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_012.md))
摘要:爬虫能采集视频吗、headers是否必须、正则能否匹配中文、根据不同情况采用不同匹配的爬虫、UTF-8和GBK的区别、课上豆瓣的网址为什么会被加长转义、windows命令行常用命令哪里找、爬虫是什么网站都能抓吗、有密码保护之类手段保护的网址如何怕取、正则里的反斜杠都要转义吗、动态排行榜怎么采集、Python框架除了构建网站还能干啥、有防爬虫无法破解的网站数据怎么获取、页面内容缺少规律性怎么配置、Terminal页面右键没有refresh选项怎么办、Python是爬虫的最佳选择吗、网页拖到下面自动加载或点击更多加载的页面怎么抓取、类似百度翻译的页面怎么找到目标url、爬虫是否有法律风险、什么样的业务适合开发框架等。 | [L12.2](https://www.bilibili.com/video/av92186118?p=31) | +| 2020年5月29日 | 第13讲:豆瓣爬虫的进化与爬虫基类的完善
摘要:构建豆瓣图书网页单页的完整爬虫,重点解决信息抽取正则表达式的配置和缺失信息项的处理,完善爬虫基类中headers的设置,通过继承机制构建豆瓣图书页面爬虫类。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_13.ipynb) | [L13.1](https://www.bilibili.com/video/av92186118?p=32) | +| - | 第13讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_013.md))
摘要:编程状态不好时该怎么办、课下自己调试感觉困难怎么办、如何收集股票信息并用图表分析、数据的爬取加可视化、遍历过程中为什么不建议修改遍历序列、爬虫在工作中的实例、爬虫的个性化与爬虫基类的重载、除正则表达式以外还有什么其他方式可以解析网页、有些网页加载过程中有多个doc文件该如何处理、批量爬取爬虫的架构、课上的爬虫离商业爬虫有多远、分割文本的爬取有什么高效办法、如何爬取某具体方向的内容、为什么会有乱码。 | [L13.2](https://www.bilibili.com/video/av92186118?p=33) | +| 2020年6月5日 | 第14讲:爬虫进阶之DOM树与XPath
摘要:DOM的基本概念,DOM树的基本概念,XPath的概念和基本语法,Chrome里XPath Helper扩展的使用,特定目标XPath的获取和精简,用lxml库实现网页源码的解析和XPath目标定位,基于XPath的页面信息分层抽取。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_14.ipynb) | [L14.1](https://www.bilibili.com/video/BV1b7411N7P2?p=34) | +| - | 第14讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_014.md))
摘要:如何深入了解xpath、xpath能否完全代替正则表达式、大爬虫也需要对每个网页配置抽取表达式吗、访问图片时拒绝访问如何处理、xpath里的@href什么意思、xpath的绝对路径和相对路径有何区别、DOM树与C++的树一样吗、xpath与正则表达式适用场景如何效率比较及怎样选择、遇到具体变成问题怎么找解决方案、xpath helper安装不了怎么办、lxml解析为什么要解码、表格数据如何采集、DOM和xpath的关系、pycharm怎么安装lxml、爬虫在生活中的应用、豆瓣书名副标题的采集问题、包含子节点的节点如何获取全部文本、xpath能否用于普通字符串、beautifulsoup和lxml在解析方面有什么区别吗、python能提供网页服务吗、模型到底是什么、没有插件如何快速获取xpath等。 | [L14.2](https://www.bilibili.com/video/BV1b7411N7P2?p=35) | +| 2020年6月12日 | 第15讲:翻页的爬取和采集目标分析方法
摘要:翻页链接的获取思路、末尾页链接的检测、程序的迭代改进、从解决问题的角度思考编程过程、培养对程序的"审美"、对特殊情况的推演思路、重用与可读性、url encode/quote的使用、对网站数据可用性和扩展渠道的考察。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_15.ipynb) | [L15.1](https://www.bilibili.com/video/BV1b7411N7P2?p=36) | +| - | 第15讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_015.md))
摘要:如何采集音乐网站并播放音乐、豆瓣爬虫能否为书做个类、翻页能否从最后往前翻、urllib库为什么没提供urldecode函数、采集的数据用什么形式存比较好、为什么翻页经常是start=0/20/40而不是page=1、搜索引擎也是爬虫吗、能否用正则实现翻页信息采集、采集信息不全时的解决思路、页面跳转如何采集、怎样将数据方便导入Excel、爬虫 vs. 镜像、做爬虫会违法吗、动态网页或局部刷新网页的爬取、爬虫翻页没有尾页的处理方式、这是最后一节课吗...... | [L15.2](https://www.bilibili.com/video/BV1b7411N7P2?p=37) | +| 2020年6月19日 | 第16讲:多级采集与多线程
摘要:标签采集与图书列表也采集结合的嵌套二级采集,多线程编程的相关概念:进程、线程、同步/异步、阻塞式/非阻塞式、线程池等,用concurrent.futures标准库实现多线程并行采集。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_16.ipynb) | [L16.1](https://www.bilibili.com/video/BV1b7411N7P2?p=38) | +| | 第16讲答疑([问题列表](https://github.com/fly51fly/Practical_Python_Programming/blob/master/questions/question_016.md))
摘要:GIL是什么、为什么线程池适用于IO密集型场景而进程池适用于计算密集型场景、多进程有什么实际应用、爬着爬着就什么也采集不到了是被发现了吗、为什么多线程能提高运行速度、二级列表很多页的时候怎么设置采集规则、线程池用完变空后还会占内存吗、豆瓣爬虫会限制爬取信息的数量么、多线程方式采集的上限取决于什么、如何确定爬虫程序最优线程个数、多线程采集如何保留项目在原页面的顺序信息等。 | [L16.2](https://www.bilibili.com/video/BV1b7411N7P2?p=39) | +| | 第17讲、深入探索多线程
摘要:多线程的调度顺序,原子操作的概念,f"{}"字符串简化格式化的用法,print默认参数的"秘密",信号量的概念,用信号量保证"原子操作",非定向爬虫的概念,搜索引擎的基本原理。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_17.ipynb) | [L17.1](https://www.bilibili.com/video/BV1b7411N7P2?p=40) | +| 2020年7月10日 | 第18讲、任务队列与多线程
摘要:队列和任务队列的概念和意义,多线程对任务队列的控制,线程池的"静态"线程用法,多线程的细化控制。
代码:[Jupyter Notebook](https://github.com/fly51fly/Practical_Python_Programming/blob/master/code/Python_Class_18.ipynb) | [L18.1](https://www.bilibili.com/video/BV1b7411N7P2?p=41) | -学习资源推荐: 1. [中文 Python 笔记](https://github.com/lijin-THU/notes-python) 2. [千行代码入门Python](https://github.com/xianhu/LearnPython) 3. [Python代码运行可视化](http://www.pythontutor.com/index.html) diff --git a/code/Python_Class_12.ipynb b/code/Python_Class_12.ipynb new file mode 100644 index 0000000..be57ac4 --- /dev/null +++ b/code/Python_Class_12.ipynb @@ -0,0 +1,389 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1.采——网页的采集" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "import requests" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "req = requests.get('https://wap.zol.com.cn/top/cell_phone/hot.html')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2.抽——信息的抽取" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "result = re.findall(\n", + " '

(.*?)<\\/p>[\\S\\s]*?(.*?)<\\/span>',\n", + " req.text\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3.存——保存采集结果" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "with open('mobile.txt', 'w') as f:\n", + " for item in result:\n", + " f.write(item[0] + ' ' + item[1] + '\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cat mobile.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 基础爬虫类(框架)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern):\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 对zol.com.cn进行测试" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "crawler = MyCrawler('mobile.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "CONTENT = crawler.download('https://wap.zol.com.cn/top/cell_phone/hot.html')" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "crawler.crawl(\n", + " 'https://wap.zol.com.cn/top/cell_phone/hot.html', \n", + " '

(.*?)<\\/p>[\\S\\s]*?(.*?)<\\/span>'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cat mobile.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 对bilibili进行测试" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler = MyCrawler('bilibili.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "c = b_crawler.download('https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3')" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "info = b_crawler.extract(\n", + " c, \n", + " '(.*?)<\\/a>.*?<\\/i>(.*?)<\\/span>.*?<\\/i>(.*?).*?<\\/i>(.*?).*?

(\\d+)<\\/div>'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler.crawl(\n", + " 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3',\n", + " '(.*?)<\\/a>.*?<\\/i>(.*?)<\\/span>.*?<\\/i>(.*?).*?<\\/i>(.*?).*?
(\\d+)<\\/div>',\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [], + "source": [ + "!rm bilibili.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open('bilibili.txt','r',encoding='utf-8') as f:\n", + " lines = f.read()\n", + " for line in lines.split('\\n'):\n", + " print(line)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cat bilibili.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler.crawl(\n", + " 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3',\n", + " '(.*?)<\\/a>.*?<\\/i>(.*?)<\\/span>.*?<\\/i>(.*?).*?<\\/i>(.*?).*?
(\\d+)<\\/div>'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 对豆瓣进行测试" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler = MyCrawler('douban_book.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "''" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b_crawler.download('https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 用curl.trillworks.com实现Chrome网络请求的"克隆"" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + "}\n", + "\n", + "response = requests.get('https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C', headers=headers)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "47944" + ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(response.text)" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'Neural Networks and Deep Learning' in response.text" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_13.ipynb b/code/Python_Class_13.ipynb new file mode 100644 index 0000000..76b73c2 --- /dev/null +++ b/code/Python_Class_13.ipynb @@ -0,0 +1,561 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 基础爬虫类(框架)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[1;31mDocstring:\u001b[0m\n", + "D.update([E, ]**F) -> None. Update D from dict/iterable E and F.\n", + "If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]\n", + "If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v\n", + "In either case, this is followed by: for k in F: D[k] = F[k]\n", + "\u001b[1;31mType:\u001b[0m method_descriptor\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dict.update?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 对豆瓣进行测试" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler = MyCrawler('douban_book.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "b_crawler.crawl(\n", + " 'https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C',\n", + " 'src=\"(.*?\\d+.jpg)\"[\\S\\s]*?\\s*(.*?)\\s*<\\/div>[\\S\\s]*?
\\s*([\\S\\s]*?)\\s*<\\/div>',\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [], + "source": [ + "class MyDoubanCrawler(MyCrawler):\n", + " def extract(self, content, pattern_main, pattern_star):\n", + " result = re.findall(pattern_main, content)\n", + " for index in range(len(result)):\n", + "# for book_info in result:\n", + " if 'allstar' in result[index][4]:\n", + " items = re.findall(pattern_star, result[index][4])\n", + " else:\n", + " items = [['0', '0', '0']]\n", + " result[index] = list(result[index])\n", + " del result[index][4]\n", + " result[index].extend(items[0])\n", + "# print(result[index])\n", + " return result\n", + " \n", + " def crawl(self, url, pattern_main, pattern_star, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern_main, pattern_star)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [], + "source": [ + "b_douban_crawler = MyDoubanCrawler('douban_book_new.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [], + "source": [ + "b_douban_crawler.crawl(\n", + " 'https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C',\n", + " 'src=\"(.*?\\d+.jpg)\"[\\S\\s]*?\\s*(.*?)\\s*<\\/div>[\\S\\s]*?
\\s*([\\S\\s]*?)\\s*<\\/div>',\n", + " 'allstar(\\d+)\"[\\S\\s]*?rating_nums\">([^<]*?)<\\/span>[\\S\\s]*?\\((\\d+)'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://img1.doubanio.com/view/subject/s/public/s33631858.jpg|||https://book.douban.com/subject/35044046/|||绁炵粡缃戠粶涓庢繁搴〉�〈範|||閭遍敗楣� / 鏈烘�板伐涓氬嚭鐗堢ぞ / 2020年4月10日 / 149.00鍏億||45|||9.3|||149\n", + "https://img9.doubanio.com/view/subject/s/public/s29738046.jpg|||https://book.douban.com/subject/30192800/|||Python绁炵粡缃戠粶缂栫▼|||[鑻盷濉旈噷鍏嬄锋媺甯屽痉锛圱ariq Rashid锛� / 鏋楄祼 / 浜烘皯閭�鐢靛嚭鐗堢ぞ / 2018-4 / 69.00鍏億||45|||9.2|||453\n", + "https://img1.doubanio.com/view/subject/s/public/s29839337.jpg|||https://book.douban.com/subject/30293801/|||Python娣卞害瀛〈範|||[缇嶿 寮楁湕绱(三)摝鈥(四)倴鑾� / 寮犱寒 / 浜烘皯閭�鐢靛嚭鐗堢ぞ / 2018-8 / 119.00鍏億||50|||9.5|||570\n", + "https://img9.doubanio.com/view/subject/s/public/s29815955.jpg|||https://book.douban.com/subject/30270959/|||娣卞害瀛〈範鍏ラ棬|||[ 鏃ワ冀 鏂嬭棨搴锋瘏 / 闄嗗畤鏉� / 浜烘皯閭�鐢靛嚭鐗堢ぞ / 2018-7 / 59.00鍏億||45|||9.4|||506\n", + "https://img1.doubanio.com/view/subject/s/public/s32295077.jpg|||https://book.douban.com/subject/33414479/|||娣卞害瀛〈範鐨勬暟瀛�|||[鏃�]娑屼簳鑹�骞搞�乕鏃�]娑屼簳璐炵編 / 鏉ㄧ憺榫� / 浜烘皯閭�鐢靛嚭鐗堢ぞ / 2019-4 / 69.00鍏億||45|||9.0|||111\n", + "https://img1.doubanio.com/view/subject/s/public/s33557648.jpg|||https://book.douban.com/subject/34941715/|||鏁板瓧鎬濈淮|||[钁�] 闃挎灄澶氣�(一)ゥ鍒╃淮鎷� / 鑳′′皬閿� / 涓�淇′′嚭鐗堢ぞ / 2020年1月1日 / 69.00|||45|||8.6|||12\n", + "https://img9.doubanio.com/view/subject/s/public/s33545334.jpg|||https://book.douban.com/subject/34927262/|||娣卞叆娴呭嚭鍥剧�炵粡缃戠粶锛欸NN鍘熺悊瑙f瀽|||鍒樺繝闆ㄣ��鏉庡溅闇栥��鍛ㄦ磱銆�钁� / 鏈烘�板伐涓氬嚭鐗堢ぞ / 2019年12月25日 / 89鍏億||25|||5.2|||38\n", + "https://img9.doubanio.com/view/subject/s/public/s28855545.jpg|||https://book.douban.com/subject/26727997/|||Neural Networks and Deep Learning|||Michael Nielsen / 2016-1|||45|||9.4|||202\n", + "https://img1.doubanio.com/view/subject/s/public/s29936638.jpg|||https://book.douban.com/subject/30236893/|||绁炵粡缃戠粶璁捐�★紙鍘熶功绗�2鐗堬級|||Martin T. Hagan銆丠oward B. Demuth銆丮ark H. Beale / 绔犳瘏 / 鏈烘�板伐涓氬嚭鐗堢ぞ / 2017-11 / 99.00鍏億||45|||8.8|||15\n", + "https://img3.doubanio.com/view/subject/s/public/s4410591.jpg|||https://book.douban.com/subject/4146246/|||绁炵粡缃戠粶鍦ㄥ簲鐢ㄧ�戝�〉拰宸ョ▼涓�鐨勫簲鐢▅||钀ㄩ┈鎷夎緵鑽� / 2010-1 / 88.00鍏億||0|||0|||0\n", + "https://img3.doubanio.com/view/subject/s/public/s29249951.jpg|||https://book.douban.com/subject/26945232/|||Make Your Own Neural Network|||Tariq Rashid / CreateSpace Independent Publishing Platform / 2016年3月31日 / USD 45.00|||50|||9.6|||54\n", + "https://img9.doubanio.com/view/subject/s/public/s29877486.jpg|||https://book.douban.com/subject/30333961/|||鍥捐В娣卞害瀛〈範涓庣�炵粡缃戠粶锛氫粠寮犻噺鍒癟ensorFlow瀹炵幇|||寮犲钩 / 鐢靛瓙宸ヤ笟鍑虹増绀� / 2018-10 / 79.00鍏億||0|||0|||0\n", + "https://img3.doubanio.com/view/subject/s/public/s28070570.jpg|||https://book.douban.com/subject/26388161/|||MATLAB绁炵粡缃戠粶43涓�妗堜緥鍒嗘瀽|||鐜嬪皬宸濄�佸彶宄般�侀儊纾娿�佹潕娲� / 鍖椾含鑸�绌鸿埅澶╁ぇ瀛〉嚭鐗堢ぞ / 2013年8月1日 / CNY 48.00|||45|||8.5|||18\n", + "https://img3.doubanio.com/view/subject/s/public/s3898822.jpg|||https://book.douban.com/subject/2584657/|||Neural Networks and Learning Machines|||Simon O. Haykin / Pearson / 2008年11月28日 / USD 252.40|||45|||8.7|||53\n", + "https://img9.doubanio.com/view/subject/s/public/s1695376.jpg|||https://book.douban.com/subject/1138922/|||绁炵粡缃戠粶鍘熺悊(鍘熶功绗�2鐗�)|||Simon Haykin / 鍙朵笘浼熴�佸彶蹇犳�� / 鏈烘�板伐涓氬嚭鐗堢ぞ / 2004-1 / 69.00鍏億||35|||7.3|||54\n", + "https://img9.doubanio.com/view/subject/s/public/s28342396.jpg|||https://book.douban.com/subject/26666358/|||杩炴帴缁勶細閫犲氨鐙�涓�鏃犱簩鐨勪綘|||[缇嶿 鎵跨幇宄� / 瀛欏ぉ榻� / 娓呭崕澶у�〉嚭鐗堢ぞ / 2016-1 / 45|||45|||8.5|||285\n", + "https://img1.doubanio.com/view/subject/s/public/s6458908.jpg|||https://book.douban.com/subject/3890040/|||绁炵粡缃戠粶鎺у埗|||寰愪附濞� / 2009-7 / 28.00鍏億||0|||0|||0\n", + "https://img1.doubanio.com/view/subject/s/public/s28107307.jpg|||https://book.douban.com/subject/26422529/|||Neural Networks and Statistical Learning|||Ke-Lin Du銆丮. N. S. Swamy / Springer / 2013年12月7日 / USD 129.00|||0|||0|||0\n", + "https://img9.doubanio.com/view/subject/s/public/s1663944.jpg|||https://book.douban.com/subject/1159821/|||鎰忚瘑鐨勫畤瀹檤||[缇嶿 鏉版媺灏斿痉路鍩冨痉灏旀浖銆乕缇嶿 鏈卞埄娆�路鎵樿�哄凹 / 椤惧嚒鍙� / 涓婃捣绉戝�《妧鏈�鍑虹増绀� / 2004-1 / 27.00鍏億||40|||8.3|||192\n", + "https://img1.doubanio.com/view/subject/s/public/s6517517.jpg|||https://book.douban.com/subject/6529821/|||Unsupervised Learning|||A Bradford Book / 1999年6月11日 / USD 40.00|||0|||0|||0\n" + ] + } + ], + "source": [ + "!cat douban_book_new.txt" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1,2,3]" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "del a[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2]" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[1].extend" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://img1.doubanio.com/view/subject/s/public/s33631858.jpg|||https://book.douban.com/subject/35044046/|||神经网络与深度学习|||邱锡鹏 / 机械工业出版社 / 2020年4月10日 / 149.00元|||\n", + " 9.3\n", + "\n", + " \n", + " (149人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s29738046.jpg|||https://book.douban.com/subject/30192800/|||Python神经网络编程|||[英]塔里克·拉希德(Tariq Rashid) / 林赐 / 人民邮电出版社 / 2018-4 / 69.00元|||\n", + " 9.2\n", + "\n", + " \n", + " (452人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s29839337.jpg|||https://book.douban.com/subject/30293801/|||Python深度学习|||[美] 弗朗索瓦•肖莱 / 张亮 / 人民邮电出版社 / 2018-8 / 119.00元|||\n", + " 9.5\n", + "\n", + " \n", + " (570人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s29815955.jpg|||https://book.douban.com/subject/30270959/|||深度学习入门|||[ 日] 斋藤康毅 / 陆宇杰 / 人民邮电出版社 / 2018-7 / 59.00元|||\n", + " 9.4\n", + "\n", + " \n", + " (506人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s32295077.jpg|||https://book.douban.com/subject/33414479/|||深度学习的数学|||[日]涌井良幸、[日]涌井贞美 / 杨瑞龙 / 人民邮电出版社 / 2019-4 / 69.00元|||\n", + " 9.0\n", + "\n", + " \n", + " (111人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s33557648.jpg|||https://book.douban.com/subject/34941715/|||数字思维|||[葡] 阿林多•奥利维拉 / 胡小锐 / 中信出版社 / 2020年1月1日 / 69.00|||\n", + " 8.6\n", + "\n", + " \n", + " (12人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s33545334.jpg|||https://book.douban.com/subject/34927262/|||深入浅出图神经网络:GNN原理解析|||刘忠雨 李彦霖 周洋 著 / 机械工业出版社 / 2019年12月25日 / 89元|||\n", + " 5.2\n", + "\n", + " \n", + " (38人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s28855545.jpg|||https://book.douban.com/subject/26727997/|||Neural Networks and Deep Learning|||Michael Nielsen / 2016-1|||\n", + " 9.4\n", + "\n", + " \n", + " (202人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s29936638.jpg|||https://book.douban.com/subject/30236893/|||神经网络设计(原书第2版)|||Martin T. Hagan、Howard B. Demuth、Mark H. Beale / 章毅 / 机械工业出版社 / 2017-11 / 99.00元|||\n", + " 8.8\n", + "\n", + " \n", + " (15人评价)\n", + " \n", + "https://img3.doubanio.com/view/subject/s/public/s4410591.jpg|||https://book.douban.com/subject/4146246/|||神经网络在应用科学和工程中的应用|||萨马拉辛荷 / 2010-1 / 88.00元|||\n", + " (少于10人评价)\n", + " \n", + "https://img3.doubanio.com/view/subject/s/public/s29249951.jpg|||https://book.douban.com/subject/26945232/|||Make Your Own Neural Network|||Tariq Rashid / CreateSpace Independent Publishing Platform / 2016年3月31日 / USD 45.00|||\n", + " 9.6\n", + "\n", + " \n", + " (54人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s29877486.jpg|||https://book.douban.com/subject/30333961/|||图解深度学习与神经网络:从张量到TensorFlow实现|||张平 / 电子工业出版社 / 2018-10 / 79.00元|||\n", + " (少于10人评价)\n", + " \n", + "https://img3.doubanio.com/view/subject/s/public/s28070570.jpg|||https://book.douban.com/subject/26388161/|||MATLAB神经网络43个案例分析|||王小川、史峰、郁磊、李洋 / 北京航空航天大学出版社 / 2013年8月1日 / CNY 48.00|||\n", + " 8.5\n", + "\n", + " \n", + " (18人评价)\n", + " \n", + "https://img3.doubanio.com/view/subject/s/public/s3898822.jpg|||https://book.douban.com/subject/2584657/|||Neural Networks and Learning Machines|||Simon O. Haykin / Pearson / 2008年11月28日 / USD 252.40|||\n", + " 8.7\n", + "\n", + " \n", + " (53人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s1695376.jpg|||https://book.douban.com/subject/1138922/|||神经网络原理(原书第2版)|||Simon Haykin / 叶世伟、史忠植 / 机械工业出版社 / 2004-1 / 69.00元|||\n", + " 7.3\n", + "\n", + " \n", + " (54人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s28342396.jpg|||https://book.douban.com/subject/26666358/|||连接组:造就独一无二的你|||[美] 承现峻 / 孙天齐 / 清华大学出版社 / 2016-1 / 45|||\n", + " 8.5\n", + "\n", + " \n", + " (285人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s6458908.jpg|||https://book.douban.com/subject/3890040/|||神经网络控制|||徐丽娜 / 2009-7 / 28.00元|||\n", + " (少于10人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s28107307.jpg|||https://book.douban.com/subject/26422529/|||Neural Networks and Statistical Learning|||Ke-Lin Du、M. N. S. Swamy / Springer / 2013年12月7日 / USD 129.00|||\n", + " (少于10人评价)\n", + " \n", + "https://img9.doubanio.com/view/subject/s/public/s1663944.jpg|||https://book.douban.com/subject/1159821/|||意识的宇宙|||[美] 杰拉尔德·埃德尔曼、[美] 朱利欧·托诺尼 / 顾凡及 / 上海科学技术出版社 / 2004-1 / 27.00元|||\n", + " 8.3\n", + "\n", + " \n", + " (192人评价)\n", + " \n", + "https://img1.doubanio.com/view/subject/s/public/s6517517.jpg|||https://book.douban.com/subject/6529821/|||Unsupervised Learning|||A Bradford Book / 1999年6月11日 / USD 40.00|||\n", + " (少于10人评价)\n", + " \n", + "\n" + ] + } + ], + "source": [ + "with open('douban_book.txt', encoding='utf-8') as f:\n", + " print(f.read())" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + "}\n", + "\n", + "response = requests.get('https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C', headers=headers)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "48038" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(response.text)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "assert('Neural Networks and Deep Learning' in response.text)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('https://img1.doubanio.com/view/subject/s/public/s33631858.jpg',\n", + " 'https://book.douban.com/subject/35044046/',\n", + " '神经网络与深度学习',\n", + " '邱锡鹏 / 机械工业出版社 / 2020年4月10日 / 149.00元',\n", + " '\\n 9.3\\n\\n \\n (149人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s29738046.jpg',\n", + " 'https://book.douban.com/subject/30192800/',\n", + " 'Python神经网络编程',\n", + " '[英]塔里克·拉希德(Tariq Rashid) / 林赐 / 人民邮电出版社 / 2018-4 / 69.00元',\n", + " '\\n 9.2\\n\\n \\n (450人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s29839337.jpg',\n", + " 'https://book.douban.com/subject/30293801/',\n", + " 'Python深度学习',\n", + " '[美] 弗朗索瓦•肖莱 / 张亮 / 人民邮电出版社 / 2018-8 / 119.00元',\n", + " '\\n 9.5\\n\\n \\n (569人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s29815955.jpg',\n", + " 'https://book.douban.com/subject/30270959/',\n", + " '深度学习入门',\n", + " '[ 日] 斋藤康毅 / 陆宇杰 / 人民邮电出版社 / 2018-7 / 59.00元',\n", + " '\\n 9.4\\n\\n \\n (505人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s32295077.jpg',\n", + " 'https://book.douban.com/subject/33414479/',\n", + " '深度学习的数学',\n", + " '[日]涌井良幸、[日]涌井贞美 / 杨瑞龙 / 人民邮电出版社 / 2019-4 / 69.00元',\n", + " '\\n 9.0\\n\\n \\n (110人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s33557648.jpg',\n", + " 'https://book.douban.com/subject/34941715/',\n", + " '数字思维',\n", + " '[葡] 阿林多•奥利维拉 / 胡小锐 / 中信出版社 / 2020年1月1日 / 69.00',\n", + " '\\n 8.6\\n\\n \\n (12人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s33545334.jpg',\n", + " 'https://book.douban.com/subject/34927262/',\n", + " '深入浅出图神经网络:GNN原理解析',\n", + " '刘忠雨\\u3000李彦霖\\u3000周洋\\u3000著 / 机械工业出版社 / 2019年12月25日 / 89元',\n", + " '\\n 5.2\\n\\n \\n (38人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s28855545.jpg',\n", + " 'https://book.douban.com/subject/26727997/',\n", + " 'Neural Networks and Deep Learning',\n", + " 'Michael Nielsen / 2016-1',\n", + " '\\n 9.4\\n\\n \\n (202人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s29936638.jpg',\n", + " 'https://book.douban.com/subject/30236893/',\n", + " '神经网络设计(原书第2版)',\n", + " 'Martin T. Hagan、Howard B. Demuth、Mark H. Beale / 章毅 / 机械工业出版社 / 2017-11 / 99.00元',\n", + " '\\n 8.8\\n\\n \\n (15人评价)\\n '),\n", + " ('https://img3.doubanio.com/view/subject/s/public/s4410591.jpg',\n", + " 'https://book.douban.com/subject/4146246/',\n", + " '神经网络在应用科学和工程中的应用',\n", + " '萨马拉辛荷 / 2010-1 / 88.00元',\n", + " '\\n (少于10人评价)\\n '),\n", + " ('https://img3.doubanio.com/view/subject/s/public/s29249951.jpg',\n", + " 'https://book.douban.com/subject/26945232/',\n", + " 'Make Your Own Neural Network',\n", + " 'Tariq Rashid / CreateSpace Independent Publishing Platform / 2016年3月31日 / USD 45.00',\n", + " '\\n 9.6\\n\\n \\n (54人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s29877486.jpg',\n", + " 'https://book.douban.com/subject/30333961/',\n", + " '图解深度学习与神经网络:从张量到TensorFlow实现',\n", + " '张平 / 电子工业出版社 / 2018-10 / 79.00元',\n", + " '\\n (少于10人评价)\\n '),\n", + " ('https://img3.doubanio.com/view/subject/s/public/s28070570.jpg',\n", + " 'https://book.douban.com/subject/26388161/',\n", + " 'MATLAB神经网络43个案例分析',\n", + " '王小川、史峰、郁磊、李洋 / 北京航空航天大学出版社 / 2013年8月1日 / CNY 48.00',\n", + " '\\n 8.5\\n\\n \\n (18人评价)\\n '),\n", + " ('https://img3.doubanio.com/view/subject/s/public/s3898822.jpg',\n", + " 'https://book.douban.com/subject/2584657/',\n", + " 'Neural Networks and Learning Machines',\n", + " 'Simon O. Haykin / Pearson / 2008年11月28日 / USD 252.40',\n", + " '\\n 8.7\\n\\n \\n (53人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s1695376.jpg',\n", + " 'https://book.douban.com/subject/1138922/',\n", + " '神经网络原理(原书第2版)',\n", + " 'Simon Haykin / 叶世伟、史忠植 / 机械工业出版社 / 2004-1 / 69.00元',\n", + " '\\n 7.3\\n\\n \\n (54人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s28342396.jpg',\n", + " 'https://book.douban.com/subject/26666358/',\n", + " '连接组:造就独一无二的你',\n", + " '[美] 承现峻 / 孙天齐 / 清华大学出版社 / 2016-1 / 45',\n", + " '\\n 8.5\\n\\n \\n (285人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s6458908.jpg',\n", + " 'https://book.douban.com/subject/3890040/',\n", + " '神经网络控制',\n", + " '徐丽娜 / 2009-7 / 28.00元',\n", + " '\\n (少于10人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s28107307.jpg',\n", + " 'https://book.douban.com/subject/26422529/',\n", + " 'Neural Networks and Statistical Learning',\n", + " 'Ke-Lin Du、M. N. S. Swamy / Springer / 2013年12月7日 / USD 129.00',\n", + " '\\n (少于10人评价)\\n '),\n", + " ('https://img9.doubanio.com/view/subject/s/public/s1663944.jpg',\n", + " 'https://book.douban.com/subject/1159821/',\n", + " '意识的宇宙',\n", + " '[美] 杰拉尔德·埃德尔曼、[美] 朱利欧·托诺尼 / 顾凡及 / 上海科学技术出版社 / 2004-1 / 27.00元',\n", + " '\\n 8.3\\n\\n \\n (192人评价)\\n '),\n", + " ('https://img1.doubanio.com/view/subject/s/public/s6517517.jpg',\n", + " 'https://book.douban.com/subject/6529821/',\n", + " 'Unsupervised Learning',\n", + " 'A Bradford Book / 1999年6月11日 / USD 40.00',\n", + " '\\n (少于10人评价)\\n ')]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "re.findall(\n", + " 'src=\"(.*?\\d+.jpg)\"[\\S\\s]*?\\s*(.*?)\\s*<\\/div>[\\S\\s]*?
\\s*([\\S\\s]*?)\\s*<\\/div>',\n", + " response.text\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_14.ipynb b/code/Python_Class_14.ipynb new file mode 100644 index 0000000..17ac2e8 --- /dev/null +++ b/code/Python_Class_14.ipynb @@ -0,0 +1,415 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 爬虫基类" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from lxml import html" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting package metadata (current_repodata.json): ...working... done\n", + "Solving environment: ...working... done\n", + "\n", + "# All requested packages already installed.\n", + "\n" + ] + } + ], + "source": [ + "!conda install requests" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "douban_crawler = MyCrawler('douban.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "content = douban_crawler.download('https://book.douban.com/tag/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "tree = html.fromstring(content)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "book_names = tree.xpath('//h2/a')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'神经网络与深度学习'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "book_names[0].text.strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "xpath_str = " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['神经网络与深度学习',\n", + " 'Python神经网络编程',\n", + " 'Python深度学习',\n", + " '深度学习入门',\n", + " '深度学习的数学',\n", + " '数字思维',\n", + " '深入浅出图神经网络:GNN原理解析',\n", + " 'Neural Networks and Deep Learning',\n", + " '神经网络设计(原书第2版)',\n", + " '神经网络在应用科学和工程中的应用',\n", + " 'Make Your Own Neural Network',\n", + " '图解深度学习与神经网络:从张量到TensorFlow实现',\n", + " 'MATLAB神经网络43个案例分析',\n", + " 'Neural Networks and Learning Machines',\n", + " '神经网络原理(原书第2版)',\n", + " '连接组:造就独一无二的你',\n", + " '神经网络控制',\n", + " 'Neural Networks and Statistical Learning',\n", + " '意识的宇宙',\n", + " 'Unsupervised Learning']" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(map(lambda x: x.text.strip(), tree.xpath(\"//h2/a\")))" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['邱锡鹏 / 机械工业出版社 / 2020年4月10日 / 149.00元',\n", + " '[英]塔里克·拉希德(Tariq Rashid) / 林赐 / 人民邮电出版社 / 2018-4 / 69.00元',\n", + " '[美] 弗朗索瓦•肖莱 / 张亮 / 人民邮电出版社 / 2018-8 / 119.00元',\n", + " '[ 日] 斋藤康毅 / 陆宇杰 / 人民邮电出版社 / 2018-7 / 59.00元',\n", + " '[日]涌井良幸、[日]涌井贞美 / 杨瑞龙 / 人民邮电出版社 / 2019-4 / 69.00元',\n", + " '[葡] 阿林多•奥利维拉 / 胡小锐 / 中信出版社 / 2020年1月1日 / 69.00',\n", + " '刘忠雨\\u3000李彦霖\\u3000周洋\\u3000著 / 机械工业出版社 / 2019年12月25日 / 89元',\n", + " 'Michael Nielsen / 2016-1',\n", + " 'Martin T. Hagan、Howard B. Demuth、Mark H. Beale / 章毅 / 机械工业出版社 / 2017-11 / 99.00元',\n", + " '萨马拉辛荷 / 2010-1 / 88.00元',\n", + " 'Tariq Rashid / CreateSpace Independent Publishing Platform / 2016年3月31日 / USD 45.00',\n", + " '张平 / 电子工业出版社 / 2018-10 / 79.00元',\n", + " '王小川、史峰、郁磊、李洋 / 北京航空航天大学出版社 / 2013年8月1日 / CNY 48.00',\n", + " 'Simon O. Haykin / Pearson / 2008年11月28日 / USD 252.40',\n", + " 'Simon Haykin / 叶世伟、史忠植 / 机械工业出版社 / 2004-1 / 69.00元',\n", + " '[美] 承现峻 / 孙天齐 / 清华大学出版社 / 2016-1 / 45',\n", + " '徐丽娜 / 2009-7 / 28.00元',\n", + " 'Ke-Lin Du、M. N. S. Swamy / Springer / 2013年12月7日 / USD 129.00',\n", + " '[美] 杰拉尔德·埃德尔曼、[美] 朱利欧·托诺尼 / 顾凡及 / 上海科学技术出版社 / 2004-1 / 27.00元',\n", + " 'A Bradford Book / 1999年6月11日 / USD 40.00']" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(map(lambda x: x.text.strip(), tree.xpath(\"//div[@class='pub']\")))" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['本书主要介绍神经网络与深度学习中的基础知识、主要模型(卷积神经网络、递归神经网络等)以及在计算机视觉、自然语言处理等领域的应用。',\n", + " '神经网络是一种模拟人脑的神经网络,以期能够实现类人工智能的机器学习\\n技术。\\n本书揭示神经网络背后的概念,并介绍如何通过Python实现神经网络。全书\\n分为3...',\n", + " '本书由Keras之父、现任Google人工智能研究员的弗朗索瓦•肖莱(François Chollet)执笔,详尽介绍了用Python和Keras进行深度学...',\n", + " '本书是深度学习真正意义上的入门书,深入浅出地剖析了深度学习的原理和相关技术。书中使用Python3,尽量不依赖外部库或工具,从基本的数学知识出发,带领读者从...',\n", + " '《深度学习的数学》基于丰富的图示和具体示例,通俗易懂地介绍了深度学习相关的数学知识。第1章介绍神经网络的概况;第2章介绍理解神经网络所需的数学基础知识;第3...',\n", + " '计算机、细胞和大脑有什么共同之处?计算机是人类设计的电子设备,细胞是经自然进化和选择产生的生物实体,大脑是人类思维的创造者和"容器"。但在某种程度上,它们都...',\n", + " '这是一本从原理、算法、实现、应用4个维度详细讲解图神经网络的著作,在图神经网络领域具有重大的意义。\\n本书作者是图神经网络领域的资深技术专家,作者所在的公司极...',\n", + " 'http://neuralnetworksanddeeplearning.com/',\n", + " '本书是一本易学易懂的神经网络教材,主要讨论网络结构、学习规则、训练技巧和工程应用,紧紧围绕"设计"这一视角组织材料和展开讲解,强调基本原理和训练方法,概念清...',\n", + " '《神经网络在应用科学与工程中的应用:从基本原理到复杂的模式识别》为读者提供了神经网络方面简单但却系统的介绍。\\n《神经网络在应用科学和工程中的应用从基本原理到...',\n", + " '《图解深度学习与神经网络:从张量到TensorFlow实现》是以TensorFlow 为工具介绍神经网络和深度学习的入门书,内容循序渐进,以简单示例和图例的...',\n", + " 'For graduate-level neural network courses offered in the departments of Comput...',\n", + " '★《华尔街日报》2012年度十佳非虚构图书\\n★亚马逊网站2012年编辑选择之百佳图书\\n★《出版人周刊》2012年春季十佳科学类图书\\n【内容简介】\\n基因组让你...',\n", + " '神经网络控制已发展成为"智能控制"的一个新的分支,属先进控制技术,为解决复杂的非线性、不确定、不确知系统的控制问题,开辟了一条新的途径。《神经网络控制(第3...',\n", + " '本书对意识理论进行全面研究,建立在近代神经科学基础上、致力于对意识的产生、及人们对意识的认识如何帮助其"把严格的科学描述与人类知识和经验的宽广领域联系起来"...',\n", + " 'Since its founding in 1989 by Terrence Sejnowski, Neural Computation has becom...']" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(map(lambda x: x.text.strip(), tree.xpath(\"//div[@class='info']/p\")))" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[1;31mInit signature:\u001b[0m \u001b[0mmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;31mDocstring:\u001b[0m \n", + "map(func, *iterables) --> map object\n", + "\n", + "Make an iterator that computes the function using arguments from\n", + "each of the iterables. Stops when the shortest iterable is exhausted.\n", + "\u001b[1;31mType:\u001b[0m type\n", + "\u001b[1;31mSubclasses:\u001b[0m \n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "map" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "book_infos = tree.xpath(\"//li[@class='subject-item']\")" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "神经网络与深度学习 https://book.douban.com/subject/35044046/ 邱锡鹏 / 机械工业出版社 / 2020年4月10日 / 149.00元 \n", + " 本书主要介绍神经网络与深度学习中的基础知识、主要模型(卷积神经网络、递归神经网络等)以及在计算机视觉、自然语言处理等领域的应用。\n", + "Python神经网络编程 https://book.douban.com/subject/30192800/ [英]塔里克·拉希德(Tariq Rashid) / 林赐 / 人民邮电出版社 / 2018-4 / 69.00元 \n", + " 神经网络是一种模拟人脑的神经网络,以期能够实现类人工智能的机器学习\n", + "技术。\n", + "本书揭示神经网络背后的概念,并介绍如何通过Python实现神经网络。全书\n", + "分为3...\n", + "Python深度学习 https://book.douban.com/subject/30293801/ [美] 弗朗索瓦•肖莱 / 张亮 / 人民邮电出版社 / 2018-8 / 119.00元 \n", + " 本书由Keras之父、现任Google人工智能研究员的弗朗索瓦•肖莱(François Chollet)执笔,详尽介绍了用Python和Keras进行深度学...\n", + "深度学习入门 https://book.douban.com/subject/30270959/ [ 日] 斋藤康毅 / 陆宇杰 / 人民邮电出版社 / 2018-7 / 59.00元 \n", + " 本书是深度学习真正意义上的入门书,深入浅出地剖析了深度学习的原理和相关技术。书中使用Python3,尽量不依赖外部库或工具,从基本的数学知识出发,带领读者从...\n", + "深度学习的数学 https://book.douban.com/subject/33414479/ [日]涌井良幸、[日]涌井贞美 / 杨瑞龙 / 人民邮电出版社 / 2019-4 / 69.00元 \n", + " 《深度学习的数学》基于丰富的图示和具体示例,通俗易懂地介绍了深度学习相关的数学知识。第1章介绍神经网络的概况;第2章介绍理解神经网络所需的数学基础知识;第3...\n", + "数字思维 https://book.douban.com/subject/34941715/ [葡] 阿林多•奥利维拉 / 胡小锐 / 中信出版社 / 2020年1月1日 / 69.00 \n", + " 计算机、细胞和大脑有什么共同之处?计算机是人类设计的电子设备,细胞是经自然进化和选择产生的生物实体,大脑是人类思维的创造者和"容器"。但在某种程度上,它们都...\n", + "深入浅出图神经网络:GNN原理解析 https://book.douban.com/subject/34927262/ 刘忠雨 李彦霖 周洋 著 / 机械工业出版社 / 2019年12月25日 / 89元 \n", + " 这是一本从原理、算法、实现、应用4个维度详细讲解图神经网络的著作,在图神经网络领域具有重大的意义。\n", + "本书作者是图神经网络领域的资深技术专家,作者所在的公司极...\n", + "Neural Networks and Deep Learning https://book.douban.com/subject/26727997/ Michael Nielsen / 2016-1 \n", + " http://neuralnetworksanddeeplearning.com/\n", + "神经网络设计(原书第2版) https://book.douban.com/subject/30236893/ Martin T. Hagan、Howard B. Demuth、Mark H. Beale / 章毅 / 机械工业出版社 / 2017-11 / 99.00元 \n", + " 本书是一本易学易懂的神经网络教材,主要讨论网络结构、学习规则、训练技巧和工程应用,紧紧围绕"设计"这一视角组织材料和展开讲解,强调基本原理和训练方法,概念清...\n", + "神经网络在应用科学和工程中的应用 https://book.douban.com/subject/4146246/ 萨马拉辛荷 / 2010-1 / 88.00元 \n", + " 《神经网络在应用科学与工程中的应用:从基本原理到复杂的模式识别》为读者提供了神经网络方面简单但却系统的介绍。\n", + "《神经网络在应用科学和工程中的应用从基本原理到...\n", + "Make Your Own Neural Network https://book.douban.com/subject/26945232/ Tariq Rashid / CreateSpace Independent Publishing Platform / 2016年3月31日 / USD 45.00 \n", + " N/A\n", + "图解深度学习与神经网络:从张量到TensorFlow实现 https://book.douban.com/subject/30333961/ 张平 / 电子工业出版社 / 2018-10 / 79.00元 \n", + " 《图解深度学习与神经网络:从张量到TensorFlow实现》是以TensorFlow 为工具介绍神经网络和深度学习的入门书,内容循序渐进,以简单示例和图例的...\n", + "MATLAB神经网络43个案例分析 https://book.douban.com/subject/26388161/ 王小川、史峰、郁磊、李洋 / 北京航空航天大学出版社 / 2013年8月1日 / CNY 48.00 \n", + " N/A\n", + "Neural Networks and Learning Machines https://book.douban.com/subject/2584657/ Simon O. Haykin / Pearson / 2008年11月28日 / USD 252.40 \n", + " For graduate-level neural network courses offered in the departments of Comput...\n", + "神经网络原理(原书第2版) https://book.douban.com/subject/1138922/ Simon Haykin / 叶世伟、史忠植 / 机械工业出版社 / 2004-1 / 69.00元 \n", + " N/A\n", + "连接组:造就独一无二的你 https://book.douban.com/subject/26666358/ [美] 承现峻 / 孙天齐 / 清华大学出版社 / 2016-1 / 45 \n", + " ★《华尔街日报》2012年度十佳非虚构图书\n", + "★亚马逊网站2012年编辑选择之百佳图书\n", + "★《出版人周刊》2012年春季十佳科学类图书\n", + "【内容简介】\n", + "基因组让你...\n", + "神经网络控制 https://book.douban.com/subject/3890040/ 徐丽娜 / 2009-7 / 28.00元 \n", + " 神经网络控制已发展成为"智能控制"的一个新的分支,属先进控制技术,为解决复杂的非线性、不确定、不确知系统的控制问题,开辟了一条新的途径。《神经网络控制(第3...\n", + "Neural Networks and Statistical Learning https://book.douban.com/subject/26422529/ Ke-Lin Du、M. N. S. Swamy / Springer / 2013年12月7日 / USD 129.00 \n", + " N/A\n", + "意识的宇宙 https://book.douban.com/subject/1159821/ [美] 杰拉尔德·埃德尔曼、[美] 朱利欧·托诺尼 / 顾凡及 / 上海科学技术出版社 / 2004-1 / 27.00元 \n", + " 本书对意识理论进行全面研究,建立在近代神经科学基础上、致力于对意识的产生、及人们对意识的认识如何帮助其"把严格的科学描述与人类知识和经验的宽广领域联系起来"...\n", + "Unsupervised Learning https://book.douban.com/subject/6529821/ A Bradford Book / 1999年6月11日 / USD 40.00 \n", + " Since its founding in 1989 by Terrence Sejnowski, Neural Computation has becom...\n" + ] + } + ], + "source": [ + "for book_info in book_infos:\n", + " book_name_elem = book_info.xpath('.//h2/a')[0]\n", + " book_name = book_name_elem.text.strip()\n", + " book_url = book_name_elem.attrib['href']\n", + " book_pub_info = book_info.xpath(\".//div[@class='pub']\")[0].text.strip()\n", + " book_intro = 'N/A'\n", + " book_intro_elem = book_info.xpath(\".//div[@class='info']/p\")\n", + " if book_intro_elem:\n", + " book_intro = book_intro_elem[0].text.strip()\n", + " print(book_name, book_url, book_pub_info, '\\n', book_intro)\n", + "# break" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'https://book.douban.com/subject/35044046/'" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "book_name_elem.attrib['href']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_15.ipynb b/code/Python_Class_15.ipynb new file mode 100644 index 0000000..90bbd40 --- /dev/null +++ b/code/Python_Class_15.ipynb @@ -0,0 +1,842 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from lxml import html" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "douban_crawler = MyCrawler('douban.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "content = douban_crawler.download('https://book.douban.com/tag/?view=type')\n", + "tree = html.fromstring(content)\n", + "tag_url_matches = tree.xpath('//td/a/@href')" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "tag_list = [tag_url[5:] for tag_url in tag_url_matches]" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'小说'" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tag_list[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD\n" + ] + } + ], + "source": [ + "import urllib.parse\n", + "print(urllib.parse.quote('人工智能'))" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=0&type=T\n", + "Last Start ID: 500\n", + "为什么: 关于因果关系的新科学\n", + "智能时代: 大数据与智能革命重新定义未来\n", + "生命3.0: 人工智能时代,人类的进化与重生\n", + "哥德尔、艾舍尔、巴赫: 集异璧之大成\n", + "Python深度学习\n", + "仿生人会梦见电子羊吗?\n", + "奇点临近: 当计算机智能超越人类\n", + "智能商业\n", + "复杂\n", + "深度学习\n", + "动手学深度学习\n", + "机器学习\n", + "深度学习推荐系统\n", + "自然语言处理入门\n", + "AI·未来\n", + "人工智能的未来\n", + "认知: 人行为背后的思维与智能\n", + "深度学习入门: 基于Python的理论与实现\n", + "本源\n", + "天才与算法: 人脑与AI的数学思维\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=20&type=T\n", + "人工智能: 一种现代的方法(第3版)(影印版)\n", + "深度学习: 智能时代的核心驱动力量\n", + "人工智能之不能\n", + "统计学习方法\n", + "人工智能\n", + "机器人叛乱: 在达尔文时代找到意义\n", + "Python神经网络编程\n", + "终极算法: 机器学习和人工智能如何重塑世界\n", + "复杂\n", + "创造性思维: 人工智能之父马文·明斯基论教育\n", + "统计学习方法(第2版)\n", + "Pattern Recognition and Machine Learning\n", + "人工智能基础(高中版): 高中版\n", + "智能计算系统\n", + "人工科学: 复杂性面面观\n", + "暗知识:机器认知如何颠覆商业和社会: 机器认知如何颠覆商业和社会\n", + "智慧的疆界: 从图灵机到人工智能\n", + "神经网络与深度学习\n", + "心智社会: 从细胞到人工智能,人类思维的优雅解读\n", + "人工智能时代: 人机共生下财富、工作与思维的大未来\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=40&type=T\n", + "百面机器学习: 算法工程师带你去面试\n", + "人工智能哲学\n", + "集体智慧编程\n", + "人工智能简史\n", + "第二次机器革命: 数字化技术将如何改变我们的经济与社会\n", + "第二次机器革命: 数字化技术将如何改变我们的经济与社会\n", + "人工智能: 一种现代方法(第2版)(中文版)\n", + "人工智能全球格局: 未来趋势与中国位势\n", + "皇帝新脑: 有关电脑、人脑及物理定律\n", + "人生算法\n", + "人类的终极命运: 从旧石器时代到人工智能的未来\n", + "数字思维\n", + "无人驾驶: 人工智能将从颠覆驾驶开始,全面重构人类生活\n", + "人类的认知: 思维的信息加工理论\n", + "GEB——一条永恒的金带\n", + "人工智能: 一种现代的方法\n", + "深度学习:基于案例理解深度神经网络\n", + "算法霸权: 数学杀伤性武器的威胁\n", + "智能的本质 人工智能与机器人领域的64个大问题: 人工智能与机器人领域的64个大问题\n", + "人工智能 (第2版)\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=60&type=T\n", + "深入浅出图神经网络:GNN原理解析\n", + "Bayesian Reasoning and Machine Learning\n", + "The Book of Why: The New Science of Cause and Effect\n", + "必然\n", + "直觉泵和其他思考工具\n", + "软件体的生命周期: 特德·姜科幻小说集\n", + "智能时代: 当所有的机器都能学习思考,我们的生活会如何改变\n", + "推荐系统实践\n", + "AI极简经济学\n", + "Learning From Data: A Short Course\n", + "情感机器: 人类思维与人工智能的未来\n", + "云球(第一部)\n", + "科学的极致:漫谈人工智能\n", + "量子计算机简史\n", + "语音与语言处理: :自然语言处理、计算语言学和语音识别导论\n", + "艾伦·图灵传: 如谜的解谜者\n", + "最有人性的"人": 人工智能带给我们的启示\n", + "被看见的力量: 快手是什么\n", + "心智、语言和机器: 维特根斯坦哲学和人工智能科学的对话\n", + "控制论: 或关于在动物和机器中控制和通信的科学\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=80&type=T\n", + "Python深度学习:基于PyTorch\n", + "如何创造思维: 人类思想所揭示出的奥秘\n", + "错觉: AI如何通过数据挖掘误导我们\n", + "人工智能产品经理——AI时代PM修炼手册\n", + "机器崛起: 遗失的控制论历史\n", + "Chatbot从0到1: 对话式交互设计实践指南\n", + "未来地图: 技术、商业和我们的选择\n", + "凸优化\n", + "超级智能: 路线图、危险性与应对策略\n", + "不会被机器替代的人: 智能时代的生存策略\n", + "机器学习实战\n", + "人生新算法: 用人工智能解读时间、幸运与财富\n", + "图灵的秘密: 他的生平、思想及论文解读\n", + "Reinforcement Learning: An Introduction (second edition)\n", + "Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques for Building Intelligent Systems\n", + "脑机穿越: 脑机接口改变人类未来\n", + "深入理解AutoML和AutoDL:构建自动化机器学习与深度学习平台\n", + "智能战略\n", + "计算机与人脑\n", + "心灵的未来\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=100&type=T\n", + "Artificial Intelligence: A Modern Approach\n", + "知识图谱:概念与技术\n", + "深度思考: 人工智能的终点与人类创造力的起点\n", + "灵魂机器的时代: 当计算机超过人类智能时/开放人文\n", + "人工智能的未来\n", + "Tensorflow:实战Google深度学习框架\n", + "智能浪潮: 增强时代来临\n", + "KK三部曲: 失控+科技想要什么+必然\n", + "智能革命: 迎接人工智能时代的社会、经济与文化变革\n", + "机器学习\n", + "Superintelligence: Paths, Dangers, Strategies\n", + "人机平台:商业未来行动路线图\n", + "人人都该懂的人工智能\n", + "The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition\n", + "Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library\n", + "携程人工智能实践\n", + "产品经理进阶:100个案例搞懂人工智能\n", + "神经网络在应用科学和工程中的应用: 从基本原理到复杂的模式识别\n", + "Deep Learning: Adaptive Computation and Machine Learning series\n", + "黑镜: 科幻与悬疑的绝佳组合之书\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=120&type=T\n", + "与机器人共舞\n", + "Python机器学习基础教程\n", + "Probabilistic Graphical Models: Principles and Techniques\n", + "机器学习实战:基于Scikit-Learn和TensorFlow\n", + "第四次革命\n", + "计算机视觉: 一种现代方法 第二版\n", + "神经网络设计\n", + "Foundations of Machine Learning\n", + "Information Theory, Inference and Learning Algorithms\n", + "图解机器学习\n", + "被人工智能操控的金融业: 人工知能が金融を支配する日\n", + "强化学习(第2版)\n", + "Godel, Escher, Bach: An Eternal Golden Braid\n", + "统计自然语言处理基础\n", + "基于深度学习的自然语言处理\n", + "深度学习导论\n", + "机器情人: 当情感被算法操控\n", + "神经网络设计(原书第2版)\n", + "数据挖掘导论: Introduction to Data Mining\n", + "智能机器如何思考: 深度神经网络的秘密\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=140&type=T\n", + "机器视觉\n", + "统计学习理论\n", + "人工智能导论: Introduction to Artificial Intelligence\n", + "Deep Learning with Python\n", + "Artificial Intelligence for Games, Second Edition: Intelligence for Games\n", + "南京大学人工智能本科专业教育培养体系: 培养体系\n", + "信息论、推理与学习算法\n", + "TensorFlow:实战Google深度学习框架(第2版)\n", + "计算机不能做什么: 人工智能的极限\n", + "你一定爱读的人工智能简史\n", + "人工智能产品经理:人机对话系统设计逻辑探究\n", + "推荐系统\n", + "硬战:人工智能时代的爆款产品\n", + "人工智能哲学\n", + "大脑的未来: 神经科学的愿景与隐忧\n", + "我们最后的发明: 人工智能与人类时代的终结\n", + "情感分析:挖掘观点、情感和情绪: 挖掘观点、情感和情绪\n", + "实用多元统计分析\n", + "The Singularity Is Near: When Humans Transcend Biology\n", + "决战大数据(升级版): 大数据的关键思考\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=160&type=T\n", + "分布式机器学习:算法、理论与实践\n", + "给孩子的人工智能图解: 明天开始就想用上的68个关键词\n", + "知识图谱:方法、实践与应用\n", + ""AI失业"时代生存指南: 未来5年在职场会发生什么\n", + "AI赋能:AI重新定义产品经理\n", + "艾比斯之梦\n", + "合作的复杂性: 基于参与者的竞争与合作模型\n", + "微粒社会\n", + "人工智能: 复杂问题求解的结构和策略\n", + "深度学习与图像识别:原理与实践: 学习图像识别,这本书轻松带你从0到100!阿里巴巴达摩院算法专家领衔\n", + "对冲之王(经典版): 华尔街量化投资传奇\n", + "Data-Driven Science and Engineering: Machine Learning, Dynam\n", + "Introduction to Information Retrieval\n", + "机器翻译\n", + "机器人法\n", + "算法交易员:会赚钱的人工智能\n", + "金羊毛: 世界科幻大师丛书\n", + "万物都相爱\n", + "神经网络与机器学习(原书第3版)\n", + "游戏人工智能\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=180&type=T\n", + "剑桥五重奏——机器能思考吗?: 机器能思考吗?\n", + "皇帝新脑\n", + "Introduction to Linear Algebra: Fifth Edition\n", + "语言与心智\n", + "强化学习:原理与Python实现\n", + "机·智: 从数字化车间走向智能制造\n", + "What Computers Still Can't Do: A Critique of Artificial Reason\n", + "2小时读懂物联网\n", + "人工智能: 人工智能·智能系统指南(原书第2版)\n", + "Machine Learning: A Probabilistic Perspective\n", + "The Sciences of the Artificial\n", + "人工智能十万个为什么:热AI\n", + "人工智能与法律的对话\n", + "如何创造可信的AI\n", + "人工智能: 一种现代的方法(第2版)(影印版)\n", + "智能语音时代:商业竞争、技术创新与虚拟永生: 麻省理工科技评论2019全球十大突破性技术,解密苹果、谷歌、Facebook、微\n", + "虚拟人\n", + "机器学习:算法背后的理论与优化(中外学者论AI)\n", + "计算机与人脑\n", + "意识的解释\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=200&type=T\n", + "文本数据管理与分析:信息检索与文本挖掘的实用导论\n", + "无心的机器\n", + "Programming Game AI by Example\n", + "Human Compatible: Artificial Intelligence and the Problem of Control\n", + "数字中国: 区块链、智能革命与国家治理的未来\n", + "人工智能及其应用: 第4版\n", + "和机器人一起进化: Generation Robot\n", + "人机共生:谁是不会被机器替代的人(托马斯·达文波特智能商业五部曲)\n", + "统计自然语言处理(第2版)\n", + "Foundations of Statistical Natural Language Processing\n", + "机器学习与优化\n", + "剑桥五重奏: 机器能思考吗\n", + "爱犯错的智能体\n", + "人工智能时代的教育革命\n", + "Causality: Models, Reasoning and Inference\n", + "无所遁形\n", + "数据科学家访谈录: 25位著名数据科学家的真知灼见\n", + "What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence\n", + "无人军队: 自主武器与未来战争\n", + "算法的陷阱: 超级平台、算法垄断与场景欺骗\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=220&type=T\n", + "Neural Networks and Deep Learning\n", + "人工智能会抢哪些工作\n", + "深度学习的数学\n", + "进击的科技: 从爱因斯坦到人工智能\n", + "人工智能简史\n", + "技术奇点\n", + "有限理性适应性工具箱: 适应性工具箱\n", + "图灵的大教堂: 数字宇宙开启智能时代\n", + "Advances in Financial Machine Learning\n", + "心智: 认知科学导论\n", + "Gödel, Escher, Bach: An Eternal Golden Braid\n", + "Neural Networks and Deep Learning\n", + "Army of None: Autonomous Weapons and the Future of War\n", + "工具,还是武器?: 直面人类科技最紧迫的争议性问题\n", + "模式识别\n", + "游戏人工智能编程案例精粹\n", + "Make Your Own Neural Network\n", + "Artificial Intelligence for Everyone\n", + "用户体验设计指南:从方法论到产品设计实践\n", + "白话大数据与机器学习\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=240&type=T\n", + "Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition\n", + "模式识别: 第四版\n", + "风向: 如何应对互联网变革下的知识焦虑、不确定与个人成长\n", + "人类活动中的理性\n", + "创世纪\n", + "The Creativity Code\n", + "TensorFlow机器学习项目实战\n", + "自动机器学习入门与实践: 使用Python\n", + "情感与学习技术的新视角(21世纪人类学习的革命)\n", + "计算机视觉: 模型、学习和推理\n", + "机器生命的秘密\n", + "人工智能\n", + "TensorFlow实战\n", + "Vision: A Computational Investigation into the Human Representation and Processing of Visual Information\n", + "Python编程第4版\n", + "未来简史\n", + "文本数据挖掘\n", + "人工智能: 复杂问题求解的结构和策略(原书第6版)\n", + "Prediction Machines: The Simple Economics of Artificial Intelligence\n", + "人工智能超越人类:技术奇点的冲击\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=260&type=T\n", + "认知科学哲学问题研究\n", + "智能问答与深度学习\n", + "Learning Deep Architectures for AI\n", + "人工智能: 开启颠覆性智能时代\n", + "产品改变世界: Siri如何成功创造千亿市场\n", + "Python Machine Learning Cookbook\n", + "数字创世纪: 人工生命的新科学\n", + "区块链与人工智能:数字经济新时代: 畅销书《区块链与新经济:数字货币2.0时代》全新修订升级版。《互联网\n", + "面向机器智能的TensorFlow实践\n", + "数学之美\n", + "A New Kind of Science\n", + "计算机视觉: 算法与应用\n", + "智能问答\n", + "集体智慧编程\n", + "Neural Network Methods in Natural Language Processing\n", + "神经网络原理(原书第2版)\n", + "人工科学\n", + "Handbook of Collective Intelligence\n", + "Artificial Intelligence: A Modern Approach , 4th Edition\n", + "隐藏的行为: 塑造未来的7种无形力量\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=280&type=T\n", + "预见人力资源——新时代HR的进化方法论\n", + "自然语言处理综论(第二版)\n", + "The Modularity of Mind: An Essay on Faculty Psychology\n", + "教育的未来:人工智能时代的教育变革\n", + "区块链+人工智能 下一个改变世界的经济新模式: 下一个改变世界的经济新模式\n", + "白话深度学习与TensorFlow\n", + "图像局部不变性特征与描述\n", + "机器之心\n", + "未来医疗: 智能时代的个体医疗革命\n", + "人工智能导论\n", + ""深蓝"揭秘: 追寻人工智能圣杯之旅\n", + "新机器的灵魂\n", + "知识图谱\n", + "逻辑人生: 哥德尔传\n", + "You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It's Making the World a Weirder Place\n", + "机器学习: 贝叶斯和优化方法\n", + "解密搜索引擎技术实战\n", + "我是阿爾法: 論法和人工智能\n", + "Artificial Intelligence for Games (The Morgan Kaufmann Series in Interactive 3D Technology)\n", + "Introduction to Automata Theory,Languages, and Computation\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=300&type=T\n", + "I Am a Strange Loop\n", + "仿生人会梦见电子羊吗?\n", + "Automation and Utopia: Human Flourishing in a World without Work\n", + "无人机网络与通信\n", + "贝叶斯网引论\n", + "贤二机器僧漫游人工智能\n", + "Python数据分析与挖掘实战\n", + "特征提取与图像处理\n", + "狡猾的情感: 为何愤怒、嫉妒、偏见让我们的决策更理性\n", + "Artificial Intelligence: A Very Short Introducion\n", + "Artificial Intelligence: Structures and Strategies for Complex Problem Solving (6th Edition)\n", + "人工智能狂潮: 机器人会超越人类吗?\n", + "Computability and Logic\n", + "人工智能导论: 人工智能导论\n", + "刷脸背后: 人脸检测 人脸识别 人脸检索\n", + "可穿戴创意设计:技术与时尚的融合\n", + "Artificial Intelligence in the Age of Neural Networks and Brain Computing\n", + "今日简史: 人类命运大议题\n", + "AI改变设计——人工智能时代的设计师生存手册\n", + "贪婪的大脑: 为何人类会无止境地寻求意义\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=320&type=T\n", + "智能客服机器人\n", + "Understanding Machine Learning: From Theory to Algorithms\n", + "人工智能: 国家人工智能战略行动抓手\n", + "AI思维: 从数据中创造价值的炼金术\n", + "Google如何统治世界:人工智能会是人类的敌人吗?\n", + "情感解剖图鉴\n", + "科学+预见人工智能\n", + "机器学习系统设计\n", + "控制论: 或关于在动物和机器中控制和通信的科学\n", + "新机器的灵魂\n", + "经济奇点: 人工智能时代,我们将如何谋生?\n", + "我眼中的Master\n", + "让生活更美好: 无线电科普丛书\n", + "MXNet深度学习实战\n", + "罐装神仙-壹\n", + "数理情感学: 人类情感的数学逻辑\n", + "Architects of Intelligence: The truth about AI from the people building it\n", + "人工智能原理与方法\n", + "第一本无人驾驶技术书\n", + "白话机器学习算法\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=340&type=T\n", + "感情研究指南: 情感史的框架\n", + "机器文明数学本质\n", + "Machine Learning: An Algorithmic Perspective\n", + "数据挖掘中的新方法:支持向量机: 支持向量机\n", + "Mind as Machine: A History of Cognitive Science\n", + "孤高求败: 阿尔法GO60局精彩绝招详解\n", + "人机共生: 当爱情、战争和生活都自动化了,人类该如何自处\n", + "走近2050:注意力、互联网与人工智能\n", + "Theory of Self-Reproducing Automata\n", + "从无限运算力到无限想象力:设计人工智能概览\n", + "深度学习核心技术与实践\n", + "机器世界\n", + "人工智能关我什么事: 全面了解人工智能如何改变日常生活\n", + "喝掉这"罐"书\n", + "游戏编程中的人工智能技术\n", + "Surfaces and Essences: Analogy as the Fuel and Fire of Thinking\n", + "Artificial Intelligence: Foundations of Computational Agents\n", + "超人诞生: 人类增强的新技术\n", + "没有思想的世界: 科技巨头对独立思考的威胁\n", + "ROS机器人程序设计: (原书第二版)\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=360&type=T\n", + "人工智能新时代:全球人工智能应用真实落地50例\n", + "起点人\n", + "大腦解密手冊: 誰在做決策、現實是什麼、為何沒有人是孤島、科技將如何改變大腦的未\n", + "游戏开发中的人工智能\n", + "机器学习导论(原书第3版)\n", + "概率图模型:原理与技术\n", + "Python自然语言处理实战: 核心技术与算法\n", + "推荐系统开发实战\n", + "机器危机\n", + "Python自然语言处理\n", + "神经网络控制\n", + "玩家\n", + "漫画机器学习入门\n", + "科技之巅: 《麻省理工科技评论》50大全球突破性技术深度剖析\n", + "超级技术: 改变未来社会和商业的技术趋势\n", + "Computer Vision: Models, Learning, and Inference\n", + "内向者沟通圣经\n", + "机器学习: 实用案例解析\n", + "情感计算\n", + "白话强化学习与PyTorch\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=380&type=T\n", + "营销三大算法: 引领营销进入算法时代\n", + "《裂变:秒懂人工智能的基础课》\n", + "AI+医疗健康: 智能化医疗健康的应用与未来\n", + "Python程序员面试笔试宝典\n", + "统计学关我什么事: 生活中的极简统计学\n", + "灵魂机器的时代:当计算机超过人类智能时\n", + "计算机视觉: 一种现代方法\n", + "机器与人:埃森哲论新人工智能: 埃森哲论新人工智能\n", + "游戏人工智能编程案例精粹\n", + "解析几何 (第三版)\n", + "精通数据科学:从线性回归到深度学习\n", + "模式分类: 原书第2版\n", + "人脸识别原理及算法: 动态人脸识别系统研究\n", + "The Algebraic Mind: Integrating Connectionism and Cognitive Science (Learning, Development, and Conceptual Change)\n", + "统计机器学习导论\n", + "赛先生的梦魇: 新技术革命二十讲\n", + "TensorFlow技术解析与实战\n", + "企业人工智能战略\n", + "逻辑的引擎\n", + "The Age of Spiritual Machines: When Computers Exceed Human Intelligence\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=400&type=T\n", + "Python机器学习经典实例\n", + "Python数据科学与机器学习\n", + "精通Visual C++指纹模式识别系统算法及实现\n", + "迷人的技术\n", + "Race Against the Machine: How the Digital Revolution is Accelerating Innovation, Driving Productivity, and Irreversibly Tr\n", + "神经网络与深度学习\n", + "On Intelligence: How a New Understanding of the Brain will Lead to the Creation of Truly Intelligent Machines\n", + "身体的智能: 智能科学新视角\n", + "大数据智能: 互联网时代的机器学习和自然语言处理技术\n", + "The New Division of Labor: How Computers Are Creating the Next Job Market\n", + "解析深度学习:语音识别实践\n", + "一本书读懂人工智能\n", + "如何求解问题: 现代启发式方法\n", + "第四次教育革命: 人工智能如何改变教育\n", + "The AI Delusion\n", + "The Mind's I: Fantasies And Reflections On Self & Soul\n", + "智能摄影测量学导论\n", + "让法律人读懂人工智能\n", + "Neural Networks and Statistical Learning\n", + "Neural Networks and Learning Machines: Third Edition\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=420&type=T\n", + "内容算法: 把内容变成价值的效率系统\n", + "复杂的引擎\n", + "人工智能导论\n", + "超级思维: 人类和计算机一起思考的惊人力量\n", + "Talking Nets\n", + "计算机程序的构造和解释: 原书第2版\n", + "Memory and the Computational Brain: Why Cognitive Science will Transform Neuroscience\n", + "Matrix Computations\n", + "The Philosophy of Artificial Intelligence\n", + "人工智能会取代人类吗?: 智能时代的人类未来\n", + "机器学习在线:解析阿里云机器学习平台\n", + "人有人的用处: 控制论与社会\n", + "必然\n", + "通信与移动系统\n", + "新版机器人技术手册\n", + "推荐系统: 技术、评估及高效算法\n", + "人工智能的冲击: 失去工作,还是不用工作?\n", + "计算机和人脑\n", + "香农传: 从0到1开创信息时代\n", + "大数据架构商业之路: 从业务需求到技术方案\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=440&type=T\n", + "深入理解XGBoost:高效机器学习算法与进阶\n", + "Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again\n", + "深度强化学习: 原理与实践\n", + "克隆版大脑\n", + "聊天机器人:对话式体验产品设计\n", + "Neural-Symbolic Cognitive Reasoning\n", + "统计之美: 人工智能时代的科学思维\n", + "超人类革命: 生物科技将如何改变我们的未来?\n", + "机器翻译简明教程: 翻译专业本科生系列教材\n", + "Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World\n", + "从掷骰子到阿尔法狗:趣谈概率\n", + "智能驾驶技术:路径规划与导航控制\n", + "计算智能导论\n", + "Python机器学习(原书第2版)\n", + "三体智能革命\n", + "计算机科学中的数学: 信息与智能时代的必修课\n", + "中国城市大洗牌\n", + "人工智能革命: 历史、当下与未来\n", + "AI的25种可能\n", + "玩具\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=460&type=T\n", + "数据挖掘导论\n", + "深度学习与计算机视觉: 算法原理、框架应用与代码实现\n", + "Reinforcement Learning: An Introduction\n", + "Life 3.0: Being Human in the Age of Artificial Intelligence\n", + "Machine Learning in Action\n", + "I Am a Strange Loop\n", + "反常识\n", + "The Future of the Mind: The Scientific Quest to Understand, Enhance, and Empower the Mind\n", + "2030年の世界地図帳: あたらしい経済とSDGs、未来への展望\n", + "2030·终点镇\n", + "心我论: 对自我和灵魂的奇思冥想\n", + "5G+AI智能商业:商业变革和产业机遇\n", + "人工智能学院本硕博培养体系\n", + "Artificial Intelligence (3rd Edition)\n", + "AI世代生存哲學大思考: 人人都必須了解的「新AI學」\n", + "The Zero Marginal Cost Society: The Internet of Things, the Collaborative Commons, and the Eclipse of Capitalism\n", + "Hello World: How to be Human in the Age of the Machine\n", + "Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Mach\n", + "The Creativity Code: Art and Innovation in the Age of AI\n", + "Abstraction in Artificial Intelligence and Complex Systems\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=480&type=T\n", + "Python深度学习: 用Python快速学习深度神经网络\n", + "人类帝国的覆灭: 一个机器人的回忆录\n", + "Statistical Rethinking: A Bayesian Course with Examples in R and Stan\n", + "用户的本质: 数字化时代的精准运营法则\n", + "人工智能导论\n", + "未来地图: 创造人工智能万亿级产业的商业模式和路径\n", + "Keras快速上手:基于Python的深度学习实战\n", + "Feature Selection for High-Dimensional Data (Artificial Intelligence: Foundations, Theory, and Algorithms)\n", + "万物重构:智能社会来临前夜的思索\n", + "未来之路: 科技、商业和人类的选择\n", + "大脑、机器和数学\n", + "人工智能的进化: 计算机思维离人类心智还有多远\n", + "Programming Collective Intelligence: Building Smart Web 2.0 Applications\n", + "Handbook of Research on Synthesizing Human Emotion in Intelligent Systems and Robotics(智能系统与机器人技术的合成人类情感研究手册(丛书))\n", + "微表情心理学: 读心识人准到骨子\n", + "海伯利安\n", + "Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing (Surveys and Tutorials in the Applied Mathematical Sciences\n", + "机器人战争: 21世纪机器人技术革命与反思\n", + "认知神经科学: 关于心智的生物学\n", + "算法小时代: 从数学到生活的历变\n", + "https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start=500&type=T\n", + "Computer Vision: Algorithms and Applications\n", + "Darwin among the Machines: The Evolution of Global Intelligence\n", + "The Most Human Human: What Talking with Computers Teaches Us About What It Means to Be Alive\n", + "失控: 全人类的最终命运和结局\n", + "人类简史: 从动物到上帝\n" + ] + } + ], + "source": [ + "import re\n", + "import time\n", + "\n", + "page_id = 1\n", + "last_start = 0\n", + "while 1:\n", + " start_id = 20 * (page_id - 1)\n", + " url = 'https://book.douban.com/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD?start={}&type=T'.format(start_id)\n", + " print(url)\n", + " content = douban_crawler.download(url)\n", + " tree = html.fromstring(content)\n", + " if page_id == 1:\n", + " page_links = tree.xpath(\"//div[@class='paginator']/a[last()]/@href\")\n", + " if page_links:\n", + " last_start = int(re.findall('start=(\\d+)', page_links[0])[0])\n", + " print('Last Start ID: ', last_start)\n", + " book_infos = tree.xpath(\"//li[@class='subject-item']\")\n", + " for book_info in book_infos:\n", + " book_name_elem = book_info.xpath('.//h2/a')[0]\n", + " book_name = re.sub('\\s{2,}', '', book_name_elem.text_content().replace('\\n', ''))\n", + " book_url = book_name_elem.attrib['href']\n", + " book_pub_info = book_info.xpath(\".//div[@class='pub']\")[0].text.strip()\n", + " book_intro = 'N/A'\n", + " book_intro_elem = book_info.xpath(\".//div[@class='info']/p\")\n", + " if book_intro_elem:\n", + " book_intro = book_intro_elem[0].text.strip()\n", + " print(book_name)\n", + " page_id += 1\n", + " if start_id == last_start:\n", + " break\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Python深度学习: 用Python快速学习深度神经网络 '" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "re.sub('\\s{2,}', '', 'Python深度学习 : 用Python快速学习深度神经网络 ')" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "s = '/tag/神经网络888?start=20&type=T'" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'20'" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "re.findall('start=(\\d+)', s)[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'20'" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[s.index('start=')+6:-7]" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.index('start=')" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "18" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.index('&type')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_16.ipynb b/code/Python_Class_16.ipynb new file mode 100644 index 0000000..5eaf3d2 --- /dev/null +++ b/code/Python_Class_16.ipynb @@ -0,0 +1,414 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "url = 'https://book.douban.com/tag/?view=type'\n", + "content = douban_crawler.download(url)\n", + "tree = html.fromstring(content)\n", + "tags = tree.xpath(\"//td/a/text()\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'%E5%B0%8F%E8%AF%B4'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urllib.parse.quote(tags[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Current tag: 小说\n", + "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T\n", + "Last Start ID: 7600\n", + "活着\n", + "房思琪的初恋乐园\n", + "白夜行\n", + "解忧杂货店\n", + "红楼梦\n", + "追风筝的人\n", + "百年孤独\n", + "小王子\n", + "围城\n", + "平凡的世界(全三部)\n", + "嫌疑人X的献身\n", + "霍乱时期的爱情\n", + "1984\n", + "飘\n", + "月亮与六便士\n", + "三体: "地球往事"三部曲之一\n", + "三体全集: 地球往事三部曲\n", + "局外人\n", + "杀死一只知更鸟\n", + "骆驼祥子\n", + "------------------------------------\n", + "Current tag: 外国文学\n", + "https://book.douban.com/tag/%E5%A4%96%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "小王子\n", + "追风筝的人\n", + "百年孤独\n", + "飘\n", + "1984\n", + "霍乱时期的爱情\n", + "月亮与六便士\n", + "月亮和六便士\n", + "杀死一只知更鸟\n", + "傲慢与偏见\n", + "局外人\n", + "动物农场\n", + "安徒生童话故事集\n", + "简爱(英文全本)\n", + "老人与海\n", + "基督山伯爵\n", + "哈利•波特\n", + "一个陌生女人的来信\n", + "牧羊少年奇幻之旅\n", + "肖申克的救赎\n", + "------------------------------------\n", + "Current tag: 文学\n", + "https://book.douban.com/tag/%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "你当像鸟飞往你的山\n", + "房思琪的初恋乐园\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "追风筝的人\n", + "围城\n", + "活着\n", + "平凡的世界(全三部)\n", + "解忧杂货店\n", + "撒哈拉的故事\n", + "霍乱时期的爱情\n", + "月亮和六便士\n", + "1984\n", + "边城\n", + "局外人\n", + "许三观卖血记\n", + "白鹿原: 20周年精装典藏版\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "云边有个小卖部\n", + "------------------------------------\n", + "Current tag: 经典\n", + "https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=0&type=T\n", + "Last Start ID: 7820\n", + "活着\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "围城\n", + "飘\n", + "平凡的世界(全三部)\n", + "三体全集: 地球往事三部曲\n", + "骆驼祥子\n", + "月亮与六便士\n", + "哈利•波特\n", + "杀死一只知更鸟\n", + "霍乱时期的爱情\n", + "傲慢与偏见\n", + "1984\n", + "追风筝的人\n", + "边城\n", + "安徒生童话故事集\n", + "围城\n", + "白鹿原: 20周年精装典藏版\n", + "------------------------------------\n", + "Current tag: 中国文学\n", + "https://book.douban.com/tag/%E4%B8%AD%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7720\n", + "活着\n", + "围城\n", + "平凡的世界(全三部)\n", + "骆驼祥子\n", + "边城\n", + "城南旧事: 纪念普及版\n", + "明朝那些事儿(1-9): 限量版\n", + "撒哈拉的故事\n", + "红楼梦\n", + "白鹿原: 20周年精装典藏版\n", + "许三观卖血记\n", + "三体全集: 地球往事三部曲\n", + "呐喊\n", + "房思琪的初恋乐园\n", + "平凡的世界\n", + "围城\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "许三观卖血记\n", + "朝花夕拾\n", + "人生海海\n", + "------------------------------------\n" + ] + } + ], + "source": [ + "import re\n", + "import time\n", + "import requests\n", + "from lxml import html\n", + "import urllib.parse\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "tag_list_url = 'https://book.douban.com/tag/?view=type'\n", + "tag_content = douban_crawler.download(tag_list_url)\n", + "tag_tree = html.fromstring(tag_content)\n", + "tags = tag_tree.xpath(\"//td/a/text()\")\n", + "for tag in tags[:5]:\n", + " print('Current tag:', tag)\n", + " tag = urllib.parse.quote(tag)\n", + " page_id = 1\n", + " last_start = 0\n", + " while 1:\n", + " start_id = 20 * (page_id - 1)\n", + " url = 'https://book.douban.com/tag/{}?start={}&type=T'.format(tag, start_id)\n", + " print(url)\n", + " content = douban_crawler.download(url)\n", + " tree = html.fromstring(content)\n", + " if page_id == 1:\n", + " page_links = tree.xpath(\"//div[@class='paginator']/a[last()]/@href\")\n", + " if page_links:\n", + " last_start = int(re.findall('start=(\\d+)', page_links[0])[0])\n", + " print('Last Start ID: ', last_start)\n", + " book_infos = tree.xpath(\"//li[@class='subject-item']\")\n", + " for book_info in book_infos:\n", + " book_name_elem = book_info.xpath('.//h2/a')[0]\n", + " book_name = re.sub('\\s{2,}', '', book_name_elem.text_content().replace('\\n', ''))\n", + " book_url = book_name_elem.attrib['href']\n", + " book_pub_info = book_info.xpath(\".//div[@class='pub']\")[0].text.strip()\n", + " book_intro = 'N/A'\n", + " book_intro_elem = book_info.xpath(\".//div[@class='info']/p\")\n", + " if book_intro_elem:\n", + " book_intro = book_intro_elem[0].text.strip()\n", + " print(book_name)\n", + " page_id += 1\n", + " if start_id == last_start:\n", + " break\n", + " print('------------------------------------')\n", + " break\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "urls = [f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={start_id}&type=T' for start_id in range(0, 200, 20)]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urls" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "Wall time: 1.11 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "import requests\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "# Retrieve a single page and report the URL and contents\n", + "def load_url(url):\n", + " global douban_crawler\n", + " return douban_crawler.download(url)\n", + "\n", + "# We can use a with statement to ensure threads are cleaned up promptly\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_url = {executor.submit(load_url, url): url for url in urls}\n", + " for future in concurrent.futures.as_completed(future_to_url):\n", + " url = future_to_url[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s' % (url, exc))\n", + " else:\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "Wall time: 2.69 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "for url in urls:\n", + " data = douban_crawler.download(url)\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_17.ipynb b/code/Python_Class_17.ipynb new file mode 100644 index 0000000..4d05e71 --- /dev/null +++ b/code/Python_Class_17.ipynb @@ -0,0 +1,562 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "url = 'https://book.douban.com/tag/?view=type'\n", + "content = douban_crawler.download(url)\n", + "tree = html.fromstring(content)\n", + "tags = tree.xpath(\"//td/a/text()\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'%E5%B0%8F%E8%AF%B4'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urllib.parse.quote(tags[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Current tag: 小说\n", + "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T\n", + "Last Start ID: 7600\n", + "活着\n", + "房思琪的初恋乐园\n", + "白夜行\n", + "解忧杂货店\n", + "红楼梦\n", + "追风筝的人\n", + "百年孤独\n", + "小王子\n", + "围城\n", + "平凡的世界(全三部)\n", + "嫌疑人X的献身\n", + "霍乱时期的爱情\n", + "1984\n", + "飘\n", + "月亮与六便士\n", + "三体: "地球往事"三部曲之一\n", + "三体全集: 地球往事三部曲\n", + "局外人\n", + "杀死一只知更鸟\n", + "骆驼祥子\n", + "------------------------------------\n", + "Current tag: 外国文学\n", + "https://book.douban.com/tag/%E5%A4%96%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "小王子\n", + "追风筝的人\n", + "百年孤独\n", + "飘\n", + "1984\n", + "霍乱时期的爱情\n", + "月亮与六便士\n", + "月亮和六便士\n", + "杀死一只知更鸟\n", + "傲慢与偏见\n", + "局外人\n", + "动物农场\n", + "安徒生童话故事集\n", + "简爱(英文全本)\n", + "老人与海\n", + "基督山伯爵\n", + "哈利•波特\n", + "一个陌生女人的来信\n", + "牧羊少年奇幻之旅\n", + "肖申克的救赎\n", + "------------------------------------\n", + "Current tag: 文学\n", + "https://book.douban.com/tag/%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "你当像鸟飞往你的山\n", + "房思琪的初恋乐园\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "追风筝的人\n", + "围城\n", + "活着\n", + "平凡的世界(全三部)\n", + "解忧杂货店\n", + "撒哈拉的故事\n", + "霍乱时期的爱情\n", + "月亮和六便士\n", + "1984\n", + "边城\n", + "局外人\n", + "许三观卖血记\n", + "白鹿原: 20周年精装典藏版\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "云边有个小卖部\n", + "------------------------------------\n", + "Current tag: 经典\n", + "https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=0&type=T\n", + "Last Start ID: 7820\n", + "活着\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "围城\n", + "飘\n", + "平凡的世界(全三部)\n", + "三体全集: 地球往事三部曲\n", + "骆驼祥子\n", + "月亮与六便士\n", + "哈利•波特\n", + "杀死一只知更鸟\n", + "霍乱时期的爱情\n", + "傲慢与偏见\n", + "1984\n", + "追风筝的人\n", + "边城\n", + "安徒生童话故事集\n", + "围城\n", + "白鹿原: 20周年精装典藏版\n", + "------------------------------------\n", + "Current tag: 中国文学\n", + "https://book.douban.com/tag/%E4%B8%AD%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7720\n", + "活着\n", + "围城\n", + "平凡的世界(全三部)\n", + "骆驼祥子\n", + "边城\n", + "城南旧事: 纪念普及版\n", + "明朝那些事儿(1-9): 限量版\n", + "撒哈拉的故事\n", + "红楼梦\n", + "白鹿原: 20周年精装典藏版\n", + "许三观卖血记\n", + "三体全集: 地球往事三部曲\n", + "呐喊\n", + "房思琪的初恋乐园\n", + "平凡的世界\n", + "围城\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "许三观卖血记\n", + "朝花夕拾\n", + "人生海海\n", + "------------------------------------\n" + ] + } + ], + "source": [ + "import re\n", + "import time\n", + "import requests\n", + "from lxml import html\n", + "import urllib.parse\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "tag_list_url = 'https://book.douban.com/tag/?view=type'\n", + "tag_content = douban_crawler.download(tag_list_url)\n", + "tag_tree = html.fromstring(tag_content)\n", + "tags = tag_tree.xpath(\"//td/a/text()\")\n", + "for tag in tags[:5]:\n", + " print('Current tag:', tag)\n", + " tag = urllib.parse.quote(tag)\n", + " page_id = 1\n", + " last_start = 0\n", + " while 1:\n", + " start_id = 20 * (page_id - 1)\n", + " url = 'https://book.douban.com/tag/{}?start={}&type=T'.format(tag, start_id)\n", + " print(url)\n", + " content = douban_crawler.download(url)\n", + " tree = html.fromstring(content)\n", + " if page_id == 1:\n", + " page_links = tree.xpath(\"//div[@class='paginator']/a[last()]/@href\")\n", + " if page_links:\n", + " last_start = int(re.findall('start=(\\d+)', page_links[0])[0])\n", + " print('Last Start ID: ', last_start)\n", + " book_infos = tree.xpath(\"//li[@class='subject-item']\")\n", + " for book_info in book_infos:\n", + " book_name_elem = book_info.xpath('.//h2/a')[0]\n", + " book_name = re.sub('\\s{2,}', '', book_name_elem.text_content().replace('\\n', ''))\n", + " book_url = book_name_elem.attrib['href']\n", + " book_pub_info = book_info.xpath(\".//div[@class='pub']\")[0].text.strip()\n", + " book_intro = 'N/A'\n", + " book_intro_elem = book_info.xpath(\".//div[@class='info']/p\")\n", + " if book_intro_elem:\n", + " book_intro = book_intro_elem[0].text.strip()\n", + " print(book_name)\n", + " page_id += 1\n", + " if start_id == last_start:\n", + " break\n", + " print('------------------------------------')\n", + " break\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "urls = [f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={start_id}&type=T' for start_id in range(0, 200, 20)]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urls" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "Wall time: 1.11 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "import requests\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "# Retrieve a single page and report the URL and contents\n", + "def load_url(url):\n", + " global douban_crawler\n", + " return douban_crawler.download(url)\n", + "\n", + "# We can use a with statement to ensure threads are cleaned up promptly\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_url = {executor.submit(load_url, url): url for url in urls}\n", + " for future in concurrent.futures.as_completed(future_to_url):\n", + " url = future_to_url[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s' % (url, exc))\n", + " else:\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "Wall time: 2.69 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "for url in urls:\n", + " data = douban_crawler.download(url)\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "task 1 step 1\n", + "task 1 step 2\n", + "task 1 step 3\n", + "task 1 completed.\n", + "task 1 return 0.\n", + "task 0 step 1\n", + "task 0 step 2\n", + "task 0 step 3\n", + "task 0 completed.\n", + "task 0 return 0.\n", + "task 5 step 1\n", + "task 5 step 2\n", + "task 5 step 3\n", + "task 5 completed.\n", + "task 5 return 0.\n", + "task 6 step 1\n", + "task 6 step 2\n", + "task 6 step 3\n", + "task 6 completed.\n", + "task 6 return 0.\n", + "task 7 step 1\n", + "task 7 step 2\n", + "task 7 step 3\n", + "task 7 completed.\n", + "task 7 return 0.\n", + "task 8 step 1\n", + "task 8 step 2\n", + "task 8 step 3\n", + "task 8 completed.\n", + "task 8 return 0.\n", + "task 9 step 1\n", + "task 9 step 2\n", + "task 9 step 3\n", + "task 9 completed.\n", + "task 9 return 0.\n", + "task 4 step 1\n", + "task 4 step 2\n", + "task 4 step 3\n", + "task 4 completed.\n", + "task 4 return 0.\n", + "task 2 step 1\n", + "task 2 step 2\n", + "task 2 step 3\n", + "task 2 completed.\n", + "task 2 return 0.\n", + "task 3 step 1\n", + "task 3 step 2\n", + "task 3 step 3\n", + "task 3 completed.\n", + "task 3 return 0.\n", + "Wall time: 20 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "import time\n", + "\n", + "from threading import Semaphore\n", + "\n", + "my_semaphore = Semaphore()\n", + "\n", + "def do_it(tid):\n", + " result = []\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 1\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 2\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 3\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} completed.\\n')\n", + " my_semaphore.acquire()\n", + " print(''.join(result))\n", + " my_semaphore.release()\n", + " return 0\n", + "\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_tid = {executor.submit(do_it, tid): tid for tid in range(10)}\n", + " for future in concurrent.futures.as_completed(future_to_tid):\n", + " tid = future_to_tid[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s.\\n' % (tid, exc), end='')\n", + " else:\n", + " print('task %d return %d.\\n' % (tid, data), end='')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[1;31mDocstring:\u001b[0m\n", + "print(value, ..., sep=' ', end='\\n', file=sys.stdout, flush=False)\n", + "\n", + "Prints the values to a stream, or to sys.stdout by default.\n", + "Optional keyword arguments:\n", + "file: a file-like object (stream); defaults to the current sys.stdout.\n", + "sep: string inserted between values, default a space.\n", + "end: string appended after the last value, default a newline.\n", + "flush: whether to forcibly flush the stream.\n", + "\u001b[1;31mType:\u001b[0m builtin_function_or_method\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "print?" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\t2\t31\t2\t31\t2\t3" + ] + } + ], + "source": [ + "print(1,2,3,sep='\\t',end='')\n", + "print(1,2,3,sep='\\t',end='')\n", + "print(1,2,3,sep='\\t',end='')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/code/Python_Class_18.ipynb b/code/Python_Class_18.ipynb new file mode 100644 index 0000000..d546fa3 --- /dev/null +++ b/code/Python_Class_18.ipynb @@ -0,0 +1,656 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import re\n", + "\n", + "class MyCrawler:\n", + " def __init__(self, filename):\n", + " self.filename = filename\n", + " self.headers = {\n", + " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',\n", + " }\n", + " \n", + " def download(self, url):\n", + " r = requests.get(url, headers=self.headers)\n", + " return r.text\n", + " \n", + " def extract(self, content, pattern):\n", + " result = re.findall(pattern, content)\n", + " return result\n", + " \n", + " def save(self, info):\n", + " with open(self.filename, 'a', encoding='utf-8') as f:\n", + " for item in info:\n", + " f.write('|||'.join(item) + '\\n')\n", + " \n", + " def crawl(self, url, pattern, headers=None):\n", + " if headers:\n", + " self.headers.update(headers)\n", + " content = self.download(url)\n", + " info = self.extract(content, pattern)\n", + " self.save(info)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "url = 'https://book.douban.com/tag/?view=type'\n", + "content = douban_crawler.download(url)\n", + "tree = html.fromstring(content)\n", + "tags = tree.xpath(\"//td/a/text()\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'%E5%B0%8F%E8%AF%B4'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urllib.parse.quote(tags[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Current tag: 小说\n", + "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T\n", + "Last Start ID: 7600\n", + "活着\n", + "房思琪的初恋乐园\n", + "白夜行\n", + "解忧杂货店\n", + "红楼梦\n", + "追风筝的人\n", + "百年孤独\n", + "小王子\n", + "围城\n", + "平凡的世界(全三部)\n", + "嫌疑人X的献身\n", + "霍乱时期的爱情\n", + "1984\n", + "飘\n", + "月亮与六便士\n", + "三体: "地球往事"三部曲之一\n", + "三体全集: 地球往事三部曲\n", + "局外人\n", + "杀死一只知更鸟\n", + "骆驼祥子\n", + "------------------------------------\n", + "Current tag: 外国文学\n", + "https://book.douban.com/tag/%E5%A4%96%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "小王子\n", + "追风筝的人\n", + "百年孤独\n", + "飘\n", + "1984\n", + "霍乱时期的爱情\n", + "月亮与六便士\n", + "月亮和六便士\n", + "杀死一只知更鸟\n", + "傲慢与偏见\n", + "局外人\n", + "动物农场\n", + "安徒生童话故事集\n", + "简爱(英文全本)\n", + "老人与海\n", + "基督山伯爵\n", + "哈利•波特\n", + "一个陌生女人的来信\n", + "牧羊少年奇幻之旅\n", + "肖申克的救赎\n", + "------------------------------------\n", + "Current tag: 文学\n", + "https://book.douban.com/tag/%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7640\n", + "你当像鸟飞往你的山\n", + "房思琪的初恋乐园\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "追风筝的人\n", + "围城\n", + "活着\n", + "平凡的世界(全三部)\n", + "解忧杂货店\n", + "撒哈拉的故事\n", + "霍乱时期的爱情\n", + "月亮和六便士\n", + "1984\n", + "边城\n", + "局外人\n", + "许三观卖血记\n", + "白鹿原: 20周年精装典藏版\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "云边有个小卖部\n", + "------------------------------------\n", + "Current tag: 经典\n", + "https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=0&type=T\n", + "Last Start ID: 7820\n", + "活着\n", + "小王子\n", + "红楼梦\n", + "百年孤独\n", + "围城\n", + "飘\n", + "平凡的世界(全三部)\n", + "三体全集: 地球往事三部曲\n", + "骆驼祥子\n", + "月亮与六便士\n", + "哈利•波特\n", + "杀死一只知更鸟\n", + "霍乱时期的爱情\n", + "傲慢与偏见\n", + "1984\n", + "追风筝的人\n", + "边城\n", + "安徒生童话故事集\n", + "围城\n", + "白鹿原: 20周年精装典藏版\n", + "------------------------------------\n", + "Current tag: 中国文学\n", + "https://book.douban.com/tag/%E4%B8%AD%E5%9B%BD%E6%96%87%E5%AD%A6?start=0&type=T\n", + "Last Start ID: 7720\n", + "活着\n", + "围城\n", + "平凡的世界(全三部)\n", + "骆驼祥子\n", + "边城\n", + "城南旧事: 纪念普及版\n", + "明朝那些事儿(1-9): 限量版\n", + "撒哈拉的故事\n", + "红楼梦\n", + "白鹿原: 20周年精装典藏版\n", + "许三观卖血记\n", + "三体全集: 地球往事三部曲\n", + "呐喊\n", + "房思琪的初恋乐园\n", + "平凡的世界\n", + "围城\n", + "沉默的大多数: 王小波杂文随笔全编\n", + "许三观卖血记\n", + "朝花夕拾\n", + "人生海海\n", + "------------------------------------\n" + ] + } + ], + "source": [ + "import re\n", + "import time\n", + "import requests\n", + "from lxml import html\n", + "import urllib.parse\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "tag_list_url = 'https://book.douban.com/tag/?view=type'\n", + "tag_content = douban_crawler.download(tag_list_url)\n", + "tag_tree = html.fromstring(tag_content)\n", + "tags = tag_tree.xpath(\"//td/a/text()\")\n", + "for tag in tags[:5]:\n", + " print('Current tag:', tag)\n", + " tag = urllib.parse.quote(tag)\n", + " page_id = 1\n", + " last_start = 0\n", + " while 1:\n", + " start_id = 20 * (page_id - 1)\n", + " url = 'https://book.douban.com/tag/{}?start={}&type=T'.format(tag, start_id)\n", + " print(url)\n", + " content = douban_crawler.download(url)\n", + " tree = html.fromstring(content)\n", + " if page_id == 1:\n", + " page_links = tree.xpath(\"//div[@class='paginator']/a[last()]/@href\")\n", + " if page_links:\n", + " last_start = int(re.findall('start=(\\d+)', page_links[0])[0])\n", + " print('Last Start ID: ', last_start)\n", + " book_infos = tree.xpath(\"//li[@class='subject-item']\")\n", + " for book_info in book_infos:\n", + " book_name_elem = book_info.xpath('.//h2/a')[0]\n", + " book_name = re.sub('\\s{2,}', '', book_name_elem.text_content().replace('\\n', ''))\n", + " book_url = book_name_elem.attrib['href']\n", + " book_pub_info = book_info.xpath(\".//div[@class='pub']\")[0].text.strip()\n", + " book_intro = 'N/A'\n", + " book_intro_elem = book_info.xpath(\".//div[@class='info']/p\")\n", + " if book_intro_elem:\n", + " book_intro = book_intro_elem[0].text.strip()\n", + " print(book_name)\n", + " page_id += 1\n", + " if start_id == last_start:\n", + " break\n", + " print('------------------------------------')\n", + " break\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "urls = [f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={start_id}&type=T' for start_id in range(0, 200, 20)]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T',\n", + " 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "urls" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "Wall time: 1.11 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "import requests\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "douban_crawler = MyCrawler('douban.txt')\n", + "\n", + "# Retrieve a single page and report the URL and contents\n", + "def load_url(url):\n", + " global douban_crawler\n", + " return douban_crawler.download(url)\n", + "\n", + "# We can use a with statement to ensure threads are cleaned up promptly\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_url = {executor.submit(load_url, url): url for url in urls}\n", + " for future in concurrent.futures.as_completed(future_to_url):\n", + " url = future_to_url[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s' % (url, exc))\n", + " else:\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T' page is 52753 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T' page is 52973 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T' page is 54058 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=60&type=T' page is 52622 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=80&type=T' page is 52984 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=100&type=T' page is 52683 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=120&type=T' page is 53638 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=140&type=T' page is 54098 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=160&type=T' page is 53460 bytes\n", + "'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=180&type=T' page is 53970 bytes\n", + "Wall time: 2.69 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "\n", + "# URLS = ['http://www.163.com/',\n", + "# 'http://www.sina.com.cn/',\n", + "# 'http://baidu.com/',\n", + "# 'http://youdao.com/',\n", + "# 'http://bing.com/']\n", + "\n", + "for url in urls:\n", + " data = douban_crawler.download(url)\n", + " print('%r page is %d bytes' % (url, len(data)))" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "task 1 step 1\n", + "task 1 step 2\n", + "task 1 step 3\n", + "task 1 completed.\n", + "task 1 return 0.\n", + "task 0 step 1\n", + "task 0 step 2\n", + "task 0 step 3\n", + "task 0 completed.\n", + "task 0 return 0.\n", + "task 5 step 1\n", + "task 5 step 2\n", + "task 5 step 3\n", + "task 5 completed.\n", + "task 5 return 0.\n", + "task 6 step 1\n", + "task 6 step 2\n", + "task 6 step 3\n", + "task 6 completed.\n", + "task 6 return 0.\n", + "task 7 step 1\n", + "task 7 step 2\n", + "task 7 step 3\n", + "task 7 completed.\n", + "task 7 return 0.\n", + "task 8 step 1\n", + "task 8 step 2\n", + "task 8 step 3\n", + "task 8 completed.\n", + "task 8 return 0.\n", + "task 9 step 1\n", + "task 9 step 2\n", + "task 9 step 3\n", + "task 9 completed.\n", + "task 9 return 0.\n", + "task 4 step 1\n", + "task 4 step 2\n", + "task 4 step 3\n", + "task 4 completed.\n", + "task 4 return 0.\n", + "task 2 step 1\n", + "task 2 step 2\n", + "task 2 step 3\n", + "task 2 completed.\n", + "task 2 return 0.\n", + "task 3 step 1\n", + "task 3 step 2\n", + "task 3 step 3\n", + "task 3 completed.\n", + "task 3 return 0.\n", + "Wall time: 20 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import concurrent.futures\n", + "import time\n", + "\n", + "from threading import Semaphore\n", + "\n", + "my_semaphore = Semaphore()\n", + "\n", + "def do_it(tid):\n", + " result = []\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 1\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 2\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} step 3\\n')\n", + " time.sleep(1)\n", + " result.append(f'task {tid} completed.\\n')\n", + " my_semaphore.acquire()\n", + " print(''.join(result))\n", + " my_semaphore.release()\n", + " return 0\n", + "\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_tid = {executor.submit(do_it, tid): tid for tid in range(10)}\n", + " for future in concurrent.futures.as_completed(future_to_tid):\n", + " tid = future_to_tid[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s.\\n' % (tid, exc), end='')\n", + " else:\n", + " print('task %d return %d.\\n' % (tid, data), end='')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[1;31mDocstring:\u001b[0m\n", + "print(value, ..., sep=' ', end='\\n', file=sys.stdout, flush=False)\n", + "\n", + "Prints the values to a stream, or to sys.stdout by default.\n", + "Optional keyword arguments:\n", + "file: a file-like object (stream); defaults to the current sys.stdout.\n", + "sep: string inserted between values, default a space.\n", + "end: string appended after the last value, default a newline.\n", + "flush: whether to forcibly flush the stream.\n", + "\u001b[1;31mType:\u001b[0m builtin_function_or_method\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "print?" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\t2\t31\t2\t31\t2\t3" + ] + } + ], + "source": [ + "print(1,2,3,sep='\\t',end='')\n", + "print(1,2,3,sep='\\t',end='')\n", + "print(1,2,3,sep='\\t',end='')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "import concurrent.futures\n", + "import time\n", + "from threading import Semaphore" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T processed.\n", + "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T processed.\n", + "task 1 return 1.\n", + "task 4 return 1.\n", + "task 3 return 1.\n", + "task 2 return 1.\n", + "task 0 return 1.\n" + ] + } + ], + "source": [ + "url_queue = ['https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T'] * 2\n", + "result_book_info = []\n", + "working_parser_num = 0\n", + "my_semaphore = Semaphore()\n", + "\n", + "def worker(num):\n", + " round_num = 0\n", + " global url_queue, working_parser_num, my_semaphore\n", + " while True:\n", + " url = None\n", + " my_semaphore.acquire()\n", + " if url_queue:\n", + " url = url_queue.pop()\n", + " my_semaphore.release()\n", + " if url:\n", + " working_parser_num += 1\n", + " parser(url)\n", + " working_parser_num -= 1\n", + " print(f\"{url} processed.\\n\", end='')\n", + " elif working_parser_num == 0 and round_num> 0:\n", + " break\n", + " else:\n", + " time.sleep(1)\n", + " round_num += 1\n", + " return 1\n", + "\n", + "def parser(url):\n", + " time.sleep(2)\n", + "\n", + "THREAD_NUM = 5\n", + "with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_NUM) as executor:\n", + " # Start the load operations and mark each future with its URL\n", + " future_to_tid = {executor.submit(worker, tid): tid for tid in range(THREAD_NUM)}\n", + " for future in concurrent.futures.as_completed(future_to_tid):\n", + " tid = future_to_tid[future]\n", + " try:\n", + " data = future.result()\n", + " except Exception as exc:\n", + " print('%r generated an exception: %s.\\n' % (tid, exc), end='')\n", + " else:\n", + " print('task %d return %d.\\n' % (tid, data), end='')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "worker(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/images/class_12_001.jpg b/images/class_12_001.jpg new file mode 100644 index 0000000..d78b170 Binary files /dev/null and b/images/class_12_001.jpg differ diff --git a/questions/question_011.md b/questions/question_011.md new file mode 100644 index 0000000..a429cce --- /dev/null +++ b/questions/question_011.md @@ -0,0 +1,39 @@ +| 第11讲答疑 问题列表 | +|--------------| +|爬虫有什么高级的应用| +|如何判断一个网站是不是爬虫的"软柿子"呢?| +|有没有正向最大匹配,为什么不用呢?| +|pyquery与jQuery的区别| +|爬虫难吗| +|图片影片应该怎样爬取| +|正则的用法需要记到脑子里吗?还是只要用的时候能找到怎么用然后正确使用就好?| +|想了解一下,一般网站会有什么阻碍爬虫爬取数据的举措吗?| +|还是不太明白找到网站后的过程。。。| +|请问老师摩尔斯编码除了引入高频词汇表以外,还有什么外源数据表可以更好的应用于摩尔斯编码吗| +|爬网页是否需要遵守一些规定| +|在第一个递归中,start的最小值为什么是morse_seg-max_morse_len| +|网站都可以用爬虫去获取数据吗,requests里的headers是用来干什么的?| +|可以直接用get函数获取部分内容,而不是全部?| +|如何在不知道密码情况下爬别人分享的加密的百度网盘分享链接,然后通过枚举测试以最终得到密码?| +|请问除了[\S\s]能表示所有符号外,还有那些常用表达能在正则中表示所有符号| +|爬虫可采集数据的范围?| +|为什么有些网站相对"软",怎么判别| +|有点不明白get和post的差别| +|老师您好,可以在说Python爬虫的时候说说可视化么(゚o゚;网络上的图都好漂亮,但我做不出来( ́;(;`)| +|如果被爬网页是多页的, 爬虫可不可以自动跳转页面. 而且有的网页的跳转链接并没有在html里面找到, 这时应该怎样处理| +|想系统地学习爬虫,想问老师有没有推荐的课程。
还想问一下这节课的方法对一些防爬虫性能好的网站仍然可以吗?| +|那个max(len(morse_seg)和那个max(moorse_len)分别是什么意思啊,为什么是前面减去后面的| +|想学如何用爬虫爬图片| +|老师讲的东西很干货,但是很多 感觉自己实际掌握有困难怎么办| +|爬的好像只能是一个网页中的,有点像审查元素,查看源代码,是不是还可以进一步爬取更多,同时减少时间?| +|反向最大匹配可能会因为符号 或者某些 短单词 等等报错 能不能再使用正向和其他方式同时匹配解码配合得出正确结论| +|能不能从词典文件入手提升替换效率| +|怎么编一些反爬虫的程序呢| +|正则到python里面character group自动分出来吗| +|满足什么条件的网站可以用爬虫获取信息?| +|那个span斜杠前加个反斜杠会咋样,回放很多遍没听清....windows用户要把cat改成more?还是相同发音别的单词?| +|爬虫要怎么解决验证码问题呢?| +|在网页上找要用的信息还是太麻烦了吧,就是找价格的时候还得自己浏览一遍源代码?这有没有啥简便方法?| +|怎么分辨哪种网站是"软柿子"| +|爬虫的具体作用| +|如何更改新建文件储存位置| \ No newline at end of file diff --git a/questions/question_012.md b/questions/question_012.md new file mode 100644 index 0000000..4390e1e --- /dev/null +++ b/questions/question_012.md @@ -0,0 +1,36 @@ +| 第12讲答疑 问题列表 | +|--------------| +|爬虫能采集视频并储存吗| +|headers是必须加的吗?| +|为什么大部分正则表达式只能匹配英文,怎样才能最规范完整的匹配中文?| +|有没有可能实现根据不同情况用不同匹配的爬虫| +|utf-8和gbk编码区别在哪儿| +|想知道为什么那个豆瓣的网址会转义两次?转义两次之后就定位到其他页面了嘛?| +|!rm在Windows下是什么,在哪里找有关其他的语句?| +|在那个正则里面,名称可以单独显示,价格也可以单独显示,但是两个放在一起用[/S/s]*?下面的list不能显示两个一起怎么弄| +|爬虫是什么网站都能抓吗,会不会有什么网站有保护机制(有密码)之类的,如果这样,要怎么办呢| +|老师我们的网址不一样,名称那个网址有好几个标红的斜杠,全都需要用反斜杠来纠正吗吗| +|对于排行榜是动态的情况,爬虫怎样能采得信息呢?另外,动态排行榜是一种什么样的排行榜呢?| +|Python的框架除了构建网站还能干啥?| +|请问在很多网站别人都有防止爬虫的操作,当别人防止爬虫的操作我们不能破解的时候,这时候还能用什么去读取别人的数据呀?| +|如果网页内容中不全是有规律性的 该怎么捕获?
期末考试的考核方式| +|那个pattern的顺序是不是就是说,你给crawl提供pattern ,然后通过下面的info = self.extract(content,pattern)把pattern传给extract| +|Terminal页面单机右键没有refresh是浏览器问题吗?要每次重开?| +|Python就是爬虫的最优选择吗| +|在MyCrawler类进行初始化时,不能像你写的那样在定义类时直接初始化,MyCrawler不接受任何参数,只能单独调用_init_进行初始化,这是为什么呢| +|cat不是内部或外部命令也不是可运行的程序或批处理文件请问是什么意思,怎么解决| +|将网页拖到底部会自动往下加载新的数据,或者有加载更多这样的按钮,这样的网页怎么采集数据| +|后期的bilibili和豆瓣的数据爬取是在原来类的基础上操作吗?封装的类那里运行了吗?| +|老师您好,在找百度翻译的URL的时候发现有好多URL,该怎么有效找到有功能的URL呢| +|网页中的视频和图片可以趴下了吗?| +|想问下后面是都学爬虫吗?还有在第0课中为什么我的jupyter notebook中打不开terminal?| +|bilibili网页源码搜索"跃入人海"是从哪搜索的| +|爬虫有法律风险嘛?| +|为啥要尝试加¦来防止乱码| +|b站排行榜中"综合评分"在源码中没有直接显示,要如何抓取呢| +|如果网站是要求帐号密码那要怎样爬取?| +|正则表达式的应用,转化不太懂| +|按定义说别的语言都可以设计爬虫吗| +|不加cookies好像有些网站进不去,比如百度,好像不是是否登录的问题,进入首页后,会随机分配一个cookies。| +|如何学会设计框架?一款框架的设计是否和其业务联系十分紧密?是否框架的本质在于一类业务间的高度重复性?是否源于Web开发本身类型单一,所以能出现像Flask,Django这样的通用型框架?| +|碰到需要登录或者动态验证的网站怎么办?| \ No newline at end of file diff --git a/questions/question_013.md b/questions/question_013.md new file mode 100644 index 0000000..3e49f48 --- /dev/null +++ b/questions/question_013.md @@ -0,0 +1,20 @@ +| 第13讲答疑 问题列表 | +|--------------| +|虽然跟着视频走看上去很简单,但是实际调错时却很难| +|对于一些需要登录才能看到的网页,其内容如何爬取| +|如何收集股票信息,再转成图表分析| +|如何将爬到的数据进行可视化输出呢,比如以图片+文字的形式或者以图表的形式| +|如何理解使用下标index后不会影响遍历的主体| +|爬虫在工作中有哪些实例?| +|爬虫的"个性化"是怎样的呢?实现个性化,一般是在继承的类里进行重载来实现么?| +|除了正则表达式,还有没有其他python网络爬虫的数据解析方式| +|有些网站的network里doc文件不止一个,那个时候该选择哪一个| +|爬虫提取出来的信息能用图表输出么 比如柱状图扇形图之类的 Python里有没有这样的函数呢| +|老师您好,我很想知道对于批量爬取的爬虫构架是怎样的,比如说爬取淘宝多个商家底下的评论| +|老师目前的爬虫代码在N次课程后功能已经越来越丰富,封装性也越来越好,但离商业级爬虫代码还有多远?| +|关于被分割的文本的爬取(比如百度文库,外网论文),怎么才能更加快捷?| +|re.findall()返回结果为数组形式,为什么豆瓣爬虫部分评分那里会出现元组呢?| +|如何用爬虫获取某一类具体方向的内容?直接加关键词?还是获取全部信息再筛选?| +|headers是一种http请求吗| +|为什么会有乱码| +|视频中出现的最后一次报错,然后把items变成大列表,extend的参数变成items[0],这一步不太明白| \ No newline at end of file diff --git a/questions/question_014.md b/questions/question_014.md new file mode 100644 index 0000000..c888cd0 --- /dev/null +++ b/questions/question_014.md @@ -0,0 +1,36 @@ +| 第14讲答疑 问题列表 | +|--------------| +|对于路径访问的知识老师能不能在在介绍详细点
除了视频里介绍的老师能不能再介绍些其他的操作| +|xpath可以完全代替正则运用吗| +|"大爬虫"的信息爬取可以优化到什么程度,应该不用自己再对每个网址都单独写表达式吧| +|采集到图片的URL后如何,如何展示图片啊,直接点进去的话提示服务器拒绝请求。| +|在XPath helper最后找url时(//h2/a/@href)为什么要加@| +|请问Xpath取绝对路径和相对路径有什么区别,或者优缺点?| +|这个dom树对应的是c++的树吗?二者有什么区别?| +|请问怎么判断什么时候用xpath或正则表达式来定位呢?| +|Xpath路径怎么选择| +|想问问老师有没有那种比较准确可以询问不会的代码或者了解用法的社区(感觉有的时候百度上的东西杂而且不一定对...)| +|xpath helper的官方下载渠道需要vpn,用其它渠道下载会安装失败| +|XPath一共有几种类型的节点| +|不太理解为什么使用lxml之前要先经过utf-8解码| +|book.info函数后面两个斜杠和一个斜杠的区别仍有些不清楚| +|Xpath里绝对路径和相对路径有什么区别,我们应该怎么用| +|XPath方式和正则表达式方式两种方式谁更好呢?或者说各有什么优缺点呢?个人感觉XPath获取信息的效率要高一些| +|可不可以爬取比如一个Excel表格中的内容?(表格中有很多数据想用类似正则提取需要的内容)但表格没有网址,不知怎么弄。| +|xmldom和xpath是什么关系| +|那个豆瓣图书标签那为什么按住shift晃啊晃就可以显示那个节点对应的XPATH?| +|如果是pycharm该怎么装那个包呢?| +|爬虫在生活中可以应用到哪些方面上呢| +|不是很明白为什么book_info.xpath(...)后面要加[0]?| +|那个xpath helper不太会弄。。。| +|dom和xpath能否联合使用?| +|如何判断爬虫是采用DOM树操作还是正则呢?二者对比互相有哪些明显的优势?| +|书名有两部分的,后面部分都没有爬取到;如果一个页面有文档预览,如何爬里面的字?(有插件推荐吗?| +|使用xpath提取页面数据时,//text()与/text()的区别是什么| +|老师请问,xpath是用在浏览器已经修饰过的网页较为方便,对于普通字符串的搜索也能方便的写出xpath的索引表达式么| +|beautifulsoup和lxml针对网页形成的树形结构有什么区别。| +|python可以像html和css一样做出一个成熟网页吗| +|感觉哪里都有都有点若隐若现的疑问,但是组织不出来| +|模型究竟是什么?(看了很多解释依然不是很明白)如何学会提炼模型?模型与抽象的关系是什么?数学有数学模型,计算机图像有3d模型,文档对象模型是什么模型?| +|DOM树有什么特别的作用吗?| +|xpath能不能从源代码中简便的看出来,比如根据它的div,li之类的,主要是因为xpath helper没法下载。| \ No newline at end of file diff --git a/questions/question_015.md b/questions/question_015.md new file mode 100644 index 0000000..4233519 --- /dev/null +++ b/questions/question_015.md @@ -0,0 +1,29 @@ +| 第15讲答疑 问题列表 | +|--------------| +|是否可以实现让程序自己访问音乐网站然后播放里面的音乐?| +|encoding要加url才能搜出来吗?| +|爬虫过程中是否可以把书变成一个类,将每次采集到的信息存入一个书类中,再来可以对书籍进行分类排序等等操作| +|能从最后一页向第一页翻吗| +|urllib库为什么不提供urldecode函数| +|爬到的数据该怎样储存好?是用txt还是其他的档案?| +|想知道为什么网址中页数会是start=0、20、40这种start=20(page_id-1)的规律,而不是0、1、2这样的呢?是有什么特殊的意义吗?| +|搜索引擎的原理也是爬虫吗,为什么我的爬虫运行起来很慢,而搜索引擎可以秒出结果| +|怎么把收集到的数字信息做成一个表或者图形?| +|所有由中文编码出来的串与网站url中的串都是相同的吗| +|请问正则表达式能否也能实现类似翻页采集信息的功能呢?| +|读取神经网络类所有图书的时候,发现信息不全的情况以后,老师是如何想到其他的办法去取数据的| +|网页发生了页面跳转可以用爬虫提取内容么| +|采集了很多数据,但比较乱,有没有什么快捷的方式将这些数据整理到Excel里面?| +|las还有其的使用吗?| +|爬虫运行的实际上是搜索以及复制粘贴的过程,那是否可以直接镜像复制该网站,并提高运行速度。| +|翻页的操作是否可以一次同时获取几个页面的数据?就是一次输入几个页码,然后这几页的数据就同时出现| +|目前公安实际情况中,对爬虫违法的界限是什么?爬虫从本质上就是更快速的手速,为什么会违法?| +|怎么爬取动态变化的网页| +|如果跳转页面里面还有链接,爬虫能读取里面的内容吗| +|如何翻页爬取url不变的网站| +|如何通过爬虫建立一个简单的搜索引擎?| +|encode其他使用| +|可以像小程序那样url接其他后端链接吗| +|所有含页码网页都有start=吗| +|请问老师,爬虫翻页的时候它没有给出尾页怎么办 那也没法判断终止啊| +|这是最后一节课吗?下学期还会有你的python选修吗?| \ No newline at end of file diff --git a/questions/question_016.md b/questions/question_016.md new file mode 100644 index 0000000..1656366 --- /dev/null +++ b/questions/question_016.md @@ -0,0 +1,13 @@ +| 第16讲答疑 问题列表 | +|--------------| +|python GIL锁是什么,导致线程池适用于IO密集型场景,而进程池适用于计算密集型场景| +|多进程有什么实际应用?| +|老师,我的爬虫在今天前一段时间是可以爬到东西的,但后面爬着爬着就什么也采集不到了,我这是被发现了吗?| +|还是不是很理解为什么多线程能提高运行速度,CPU每次不是只能执行一条指令嘛?| +|二级列表很多页的时候怎么设置采集规则| +|即来即用,那么如果用完变空后,要占内存吗?像动态申请吗?| +|老师您好,豆瓣爬虫会限制爬去信息的数量么,
爬取的内容最多只有400条| +|两个集合取并集,内容确实去重了,为什么检查长度的时候还是显示两个集合的长度直接相加呢| +|请问老师,采用多线程的方式和原有方式相比,是不是可以理解为用内存换时间的一种方式?
多线程方式采集的上限取决于什么呢?| +|进程池里面进程数量是否限制?如何确定某个爬虫程序中最优进程个数或者说最优进程个数与哪些因素有关,如何测试| +|想到一个问题,书签是相互独立的,多线程采抽存没问题。单个标签下,书有排名,多线程的话,可能后面页数抽取的内容会比前面页数的内容先被写进文件,会打乱原有的顺序。| \ No newline at end of file

AltStyle によって変換されたページ (->オリジナル) /