Python是一只可爱的爬虫（三）

正则表达式解析网页

re.match

import re
m = re.match('www', 'www.santostang')
print(m)
print("匹配的起点与终点: ", m.span())
print("匹配的起点与终点: ", m.start())
print("匹配的起点与终点: ", m.end())

re.match(pattern, txt, flags): pattern是正则表达式（模式），txt是待匹配的文本，flags是一些参数

re.search

import re
re_match = re.match('cc','www.ithou.cc')
re_search = re.search('cc','www.ithou.cc')
print(re_match)
print(re_search)

re.findall

# -*- coding: utf-8 -*-

import re
line = '123456 is the first number, 789 is the second'
re_match = re.match(r'[0-9]+', line)
re_search = re.search(r'[0-9]+', line)
re_findall = re.findall(r'[0-9]+', line)
print(re_match.group())
print(re_search.group())
print(re_findall) #以列表形式返回

#结果
123456
123456
['123456', '789']

总结

方法	作用	返回结果数
`re.match`	从字符串起始位置开始匹配	第一个
`re.search`	扫描整个文本，返回第一个**成功匹配的	第一个
`re.findall`	扫描整个文本返回所有成功匹配	所有

实例——抓取博客文章标题

以我的博客为例：https://ithou.cc 抓取第一页的文章标题

#coding: utf-8
import re
import requests

link = 'https://ithou.cc/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
r = requests.get(link, headers = headers)
print('状态码:', r.status_code)

html = r.text
pat = '<h2 class="post-title" itemprop="name headline"> <a class="post-title-link" href=.*?>(.*?)</a></h2>'
title_list = re.findall(pat, html)
print(title_list)

# Output 由于博客更新，可能会和下面不同
状态码: 200
['正则表达式学习笔记', '今天我去见了七堇年', 'GitHub Pages搭建博客之路（一）', 'Python是一只可爱的爬虫（二）', 'Python是一只可爱的爬虫（一）', '高考是一场修行', '写给小欣 & 小容', '东德：枪口抬高一厘米', 'Lesson 5 No wrong numbers', '微不足道']

使用BeautifulSoup解析网页

import requests
from bs4 import BeautifulSoup

link = 'https://ithou.cc'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
r = requests.get(link, headers = headers)

# BS解析网页
soup = BeautifulSoup(r.text, 'html.parser') #将网页响应体字符转换为soup对象
title_list = soup.find_all("h2", class_="post-title") #find是找到第一条结果，和match、search类似，find_all找到所有结果，是列表对象

for i in range(0,len(title_list)):
    title = title_list[i].a.text.strip()
    print(title)
	
#结果
正则表达式学习笔记
今天我去见了七堇年
GitHub Pages搭建博客之路（一）
Python是一只可爱的爬虫（二）
Python是一只可爱的爬虫（一）
高考是一场修行
写给小欣 & 小容
东德：枪口抬高一厘米
Lesson 5 No wrong numbers
微不足道

Xpath

Xpath 是一种在XML文档中查找信息的语言。

import requests
from lxml import etree

link = 'https://ithou.cc'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
r = requests.get(link, headers = headers)

html = etree.HTML(r.text)
title_list = html.xpath('//h2[@class="post-title"]/a/text()')
print(title_list)

#结果
['正则表达式学习笔记', '今天我去见了七堇年', 'GitHub Pages搭建博客之路（一）', 'Python是一只可爱的爬虫（二）', 'Python是一只可爱的爬虫（一）', '高考是一场修行', '写给小欣 & 小容', '东德：枪口抬高一厘米', 'Lesson 5 No wrong numbers', '微不足道']

总结

有这几种方法提取源码数据：BeautifulSoup、正则表达式、lxml

HTML解析器	速度	提取方式	备注
正则表达式	快	正则表达式	学习笔记
BeautifulSoup	快（使用lxml）	find / find_all
lxml	快	Xpath

正则表达式是相对较难的，就这么短短的五个字，就有整本书来介绍它。相对来说，BeautifulSoup和lxml简单些。