—恢复内容开始—
1、CSV模块使用流程
1、导入模块
impport CSV
2、打开文件(xxx.csv)
with open’xxx.csv’,’a’,encoding=’utf-8′) as f:
1、a和 a+ ‘追加’功能
a 追加写
a+ 追加写读(先写后读)
2、r 和 r+
r 只读
r+读写,先读后写
3、w,w+
w 只写
w+写读 先写后读
3、初始化写入对象
writer = csv.wirter)
4、写入数据
writer.writerow[‘孙悟空’, ‘兰陵王’])
案例:
猫眼电影top10榜单的爬取
1、网址:url
2、目标:爬取自己想要的文件
3、保存本地:csv文件
4、步骤
1、找url规律
第一页:https://maoyan.com/board/4?offset=0
第4页:https://maoyan.com/board/4?offset=30
第n页:offset=(n-1)*10
2、写正则表达式
'<div class=”movie-item-info”.*?title=”.*?)”.*?class=”star”>.* ?)</p>.*?class=”releasetime”>.*?)</p>’,re.S
练习:爬取猫王top10信息
from urllib import request
import re
import time
import csv
class MaoyanSpiderobject):
def __init__self):
self.headers = {"User-Agent": "Mozilla/5.0 Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"}
self.page = 1
# 用来计数
def get_pageself, url):
req = request.Requesturl, headers=self.headers)
res = request.urlopenreq)
html = res.read).decode'utf-8')
# 直接调用解析函数
self.parse_pagehtml)
def parse_pageself,html):
p=re.compile'<div class="movie-item-info">.*?title=".*?)".*?class="star">.*?)</p>.*?class="releasetime">.*?)</p>',re.S)
#p=re.compile'<div class="movie-item-info">.*?title=".*?)".*?class="star">.* ?)</p>.*?class="releasetime">.*?)</p>',re.S)
r_list = p.findallhtml)
# 直接调用保存函数
# r_list:['霸王别姬','张国荣','1993'),(),()]
self.write_csvr_list)
# 保存数据函数
def write_csvself,r_list):
with open'猫眼电影top10.csv','a') as f:
writer = csv.writerf)
# 依次写入每个电影信息
for r_t in r_list:
film = [
r_t[0].strip),
r_t[1].strip),
r_t[2].strip)
]
writer.writerowfilm)
#主函数
def work_omself):
for pn in range0,41,10):
url = 'https://maoyan.com/board/4?offset=%s'%strpn)
self.get_pageurl)
print'第%d页爬取成功'%self.page)
self.page += 1
time.sleep4)
if __name__ =='__main__':
begin = time.time)
spider = MaoyanSpider)
spider.work_om)
end = time.time)
print"执行时间%.2f"%end - begin))
运行截图:
—恢复内容结束—

