Python是一只可爱的爬虫(四)

存储至TXT或CSV

csv (Comma-Separated Values) 是一种逗号分隔值文件,文件以纯文本的形式存储表格数据。行与行用换行符分隔,列与列用逗号分隔。可以用记事本和Excel打开。

读取csv文件

1
2
3
4
5
6
import csv
with open('test.csv', 'r',encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
print(row)
print(row[0])

存储csv文件

1
2
3
4
5
import csv
output = ['1','2','3']
with open(r'./csv/test2.csv', 'a+', encoding='UTF-8') as csvfile:
w = csv.writer(csvfile)
w.writerow(output)

MySQL基本命令

没有在Windows上安装MySQL,所以直接在腾讯云主机上操作的。

1
2
3
4
5
6
7
> use urls;
> create table urls (
id INT NOT NULL auto_increment,
url VARCHAR(1000) NOT NULL,
content VARCHAR(4000) NOT NULL,
PRIMARY KEY(id));
> describe urls;

结果:
describe

解释:

auto_increment 自增
PRIMARY KEY(id) 设置id为主键
describe urls; 显示urls表的结构

插入数据

1
2
insert into urls(url, content) values('www.ithou.cc', '我的个人博客-吾尤爱汝');
insert into urls(url, content) values('ithou.github.io', 'GitHub Pages');

enter description here

修改数据UPDATE

修改id=2的数据

1
UPDATE urls SET url='google.com', content='google' where id=2;

修改后

enter description here

Python操作数据库

安装mysqlclient

1
pip install mysqlclient

Python连接MySQL

1
2
3
4
5
6
7
8
9
10
11
# coding: utf-8
import MySQLdb

conn = MySQLdb.connect(host='localhost', user='root', passwd='YOUR_PASSWORD', db='urls') #与对应数据库建立连接
cur = conn.cursor() # 通过获取的数据库连接conn下的cursor()创建游标
cur.execute("insert into urls (url, content) values ('www.bing.com', 'biying')") #通过游标cur执行MySQL语句
cur.close()
conn.commit() #提交事务
conn.close()

#安全关闭游标cur、连接conn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# coding: utf-8
import requests
import MySQLdb
from bs4 import BeautifulSoup
conn = MySQLdb.connect(host='localhost', user='root', passwd='YOUR_PASSWORD', db='urls', charset="utf8")
cur = conn.cursor()

link = 'https://ithou.cc'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
r = requests.get(link, headers = headers)

soup = BeautifulSoup(r.text, 'html5lib')
title_list = soup.find_all('h2', class_='post-title')

for each in title_list:
url = "https://" + each.a['href']
title = each.a.text.strip()
cur.execute("INSERT INTO urls (url, content) VALUES (%s, %s)", (url, title))

cur.close()
conn.commit()
conn.close()

结果

mysql-python

———— The End ————