requests

获取下网页源码

import requests
url="https://docs.scrapy.org/en/latest/"
response = requests.get(url)
html = response.text
print(html)

通过正则表达式进行信息提取

比如我们想提取到首页的文章的tag信息。可以通过浏览器的debug获取到关键的信息如下。

<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>

import requests
import re
url="https://quotes.toscrape.com/"
response = requests.get(url)
html_text = response.text
# print(html)
#   <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
pattern = r'<a class="tag" .*href="(?P<href>[^"]*?)">(?P<text>.*?)<\/a>'
matches_iter = re.finditer(pattern, html_text)
for match in matches_iter:
    href = match.group('href')
    text = match.group('text')
    print(f"href: {href}, text: {text}")