requests
获取下网页源码
1import requests
2url="https://docs.scrapy.org/en/latest/"
3response = requests.get(url)
4html = response.text
5print(html)
通过正则表达式进行信息提取
比如我们想提取到首页的文章的tag信息。 可以通过浏览器的debug获取到关键的信息如下。
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
1import requests
2import re
3url="https://quotes.toscrape.com/"
4response = requests.get(url)
5html_text = response.text
6# print(html)
7# <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
8pattern = r'<a class="tag" .*href="(?P<href>[^"]*?)">(?P<text>.*?)<\/a>'
9matches_iter = re.finditer(pattern, html_text)
10for match in matches_iter:
11 href = match.group('href')
12 text = match.group('text')
13 print(f"href: {href}, text: {text}")
14