用python写爬虫

如何用python写一个爬虫可以指定关键词,爬取包含该关键词内容的网页

东西/原料

  • ThinkBook15
  • Microsoft Windows10.0.19043.1083
  • PyCharm2019.2.3

方式/步调

  1. 1

    建立项目,设置项目存储位置

    d3c39889a146b7b18d94810bc77c34b33d4135c4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80
  2. 2

    安装requests模块

    51f9aa3ea8db574a7b5f6d2aa7f7dfb2dd1917c4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80b57fb6db574afa3211f1d24154b2dc19cf2c14c4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80a99a494afa32939caf9a21045719ce2c5a1b15c4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80
  3. 3

    建立py文件

    57af657f860e7c753c990243650d3aceabd7bfc4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_807a3e980e7c75e5f4269810bbb1ceaad7736bbcc4.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80
  4. 4

    编写根本爬虫框架代码

    d2987775f2c4ec99c82cfab7c3fe1e425c6b07ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  5. 5

    利用Microsoft Edge浏览器拜候百度,并进行关头词搜刮

    baab20863048614312bbb6878febf6a75e0f53ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  6. 6

    在搜刮到的页面中点击鼠标右键,在菜单中点击“查抄”打开浏览器自带的抓包东西

    7d34fbf4fcf5ee0de6a15461f96b0ce264e7baca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  7. 7

    在抓包东西中选择“收集”标签选项

    b666b2530688912ca4ba225e1b4800fc76f797ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  8. 8

    利用快捷键Ctrl+R进行刷新

    3761a73acd8920c5287f7683568a59de440788ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  9. 9

    找到名称与请求域名不异的数据包

    fab31cb375d7997b848424eef9dade49600fd9ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_8005a320a23a42a07af7604285673834bb18efc1ca.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  10. 10

    在数据包的“标头”标签选项详情中找到“查询字符串参数”,将此中的内容复制

    460fdc8333bf3bef493fae0e3f3ea8db564a1acb.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  11. 11

    将复制的字符串参数在代码中封装当作字典,并在get()方式中传入params

    e0c73a2fa872941f8e95ca6c7b5e4a237871e6cb.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  12. 12

    点窜指定的url

    不雅察发现浏览器抓到的数据包中请求URL后半部门其实就是前面找到的那些字符串参数

    a48bc2e8904800fc728666b8d42043715edb93cb.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80906dbbcadce89048048d8941130e5f20427192cb.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  13. 13

    运行代码,代码当作功运行,生当作新文件

    efb861bd4c7c34b35b30cd0c5841037de03731c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  14. 14

    打开文件查看,和前面用浏览器搜刮到的页面一样,申明爬取当作功了

    b57fb6db574afa3211edd24154b2dc19cf2c14c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  15. 15

    不雅察“字符串参数”中,wd后面的内容即为输入的关头词,是以在代码中将该参数动态化

    9b2098254193cee85e75aaad5a0ff2260c9aa8c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  16. 16

    运行代码,键入关头词,运行完当作

    c6b994775ddd884cc538d1916cef28066a01f6c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_808b3643dd884ce54a0e551959a3066b0193ddf7c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_805c9c964ce54a2f27c59dd6b0e00192dd3240f4c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80
  17. 17

    查看baidu.html文件,当作功爬取到所键入关头词相关的搜刮内容

    4b626771fe1d96d82f6b9f962ccd0c6efaf2e1c8.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fquality%2Cq_80END

注重事项

  • 分歧浏览器抓包东西界面各有分歧,需要按照具体浏览器矫捷操作
  • 发表于 2021-08-28 13:01
  • 阅读 ( 489 )
  • 分类:电脑网络

相关问题

0 条评论

请先 登录 后评论