Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

python抓取知乎话题下的问题及回答内容

Notifications You must be signed in to change notification settings

hectorhua/zhihu_topic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

23 Commits

Repository files navigation

知乎话题抓取

通过话题入口抓取话题下所有问题和回答

数据存储

MySQL: topic question answer

数据抓取

requwsts/xpath/re

配置

mac

cookie

解密chrome cookie文件

变更

知乎答案获取接口发生变化,之前是随意访问的get:
https://www.zhihu.com/api/v4/questions/{}/answers?sort_by=default&include={}&limit=20&offset={}
现在变成了post:
https://www.zhihu.com/node/QuestionAnswerListV2
Form Data:

  • method:next
  • params:{"url_token":36535039,"pagesize":10,"offset":30}

接口返回数据格式由原来的json数据变成了html,需要进一步xpath解析.
几个topic数据已在变化前全部抓下来了,后面会放到百度云上.

抓取结果

topic记录: 30
question记录: 8868
answer记录: 3145338
链接: https://pan.baidu.com/s/1slW6cSt 密码: 5fs4

建议

知乎有封禁策略,建议使用小号抓取

示例

topic save-to-screen

question save-to-screen

answer save-to-screen

About

python抓取知乎话题下的问题及回答内容

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle によって変換されたページ (->オリジナル) /