头条问答采集脚本(Python)
根据关键词采集问题列表
import urllib import random import requests import json from bs4 import BeautifulSoup agents = [ 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 EdgiOS/44.5.0.10 Mobile/15E148 Safari/604.1', 'Mozilla/5.0 (Linux; Android 9; MI 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Mobile Safari/537.36 EdgA/45.9.4.5122' ] agent = random.choice(agents) print(agent) headers = { 'Host': 'so.toutiao.com', 'User-Agent': agent } def search_questions(keyword): # keyword = urllib.parse.quote(keyword) url = 'https://so.toutiao.com/search?keyword={0}&pd=question'.format(keyword) rsp = requests.get(url=url, headers=headers) return rsp.text
用法:
search_questions('网赚')
答案采集:
根据获取的问题清单,进一步抓取,原理相同。