requests

官方文档

获取下网页源码

1import requests
2url="https://docs.scrapy.org/en/latest/"
3response = requests.get(url)
4html = response.text
5print(html)

通过正则表达式进行信息提取

比如我们想提取到首页的文章的tag信息。 可以通过浏览器的debug获取到关键的信息如下。

<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
 1import requests
 2import re
 3url="https://quotes.toscrape.com/"
 4response = requests.get(url)
 5html_text = response.text
 6# print(html)
 7#   <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
 8pattern = r'<a class="tag" .*href="(?P<href>[^"]*?)">(?P<text>.*?)<\/a>'
 9matches_iter = re.finditer(pattern, html_text)
10for match in matches_iter:
11    href = match.group('href')
12    text = match.group('text')
13    print(f"href: {href}, text: {text}")
14