爬虫实战_电影天堂爬虫

前言

近段时间一直在学习爬虫,昨天闲来无事做了一个简单的小爬虫来爬取电影天堂热门电影的前7页。本爬虫利用requests库和lxml库结合来解析得到的html代码,当然也可以使用BeautifulSoup库来解析,但我比较喜欢lxml库和XPATH语法相结合的解析。


这个实战是爬取电影天堂热门电影前七页。分为三步:1.获取每个页面的电影详细页;2.解析每个电影详细页;3.将第一步和第二步结合,爬取前七页。


具体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import requests
from lxml import etree
BASE_DOMAIN="https://www.dytt8.net"
HEADERS={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
def get_detail_url(url):
resp = requests.get(url, headers=HEADERS)
text = resp.text
# print(text)
html = etree.HTML(text)
detail_urls = html.xpath("//ul//table[@class='tbspan']//a/@href")
# def abc(url1):
# return BASE_DOMAIN+url1
# index=0
# for detail_url in detail_urls:
# detail_url = abc(detail_url)
# detail_urls[index]=detail_url
# index =index+1
detail_urls=map(lambda url:BASE_DOMAIN+url,detail_urls)
return detail_urls
def parse_detail_page(url):
resp=requests.get(url,headers=HEADERS)
text=resp.content.decode("gbk")
movie={}
#print(text)
html=etree.HTML(text)
#电影名称
title=html.xpath("//div[@class='title_all']//font[@color='#07519a']//text()")[0]
movie['title']=title
#电影海报
img=html.xpath("//div[@id='Zoom']//img/@src")
cover=img
movie['cover']=cover
#div下所有文本
infos=html.xpath("//div[@id='Zoom']//text()")
for index,info in enumerate(infos):
def page_info(info,rules):
info=info.replace(rules,"").strip()
return info
if info.startswith("◎年  代"):
info=page_info(info,"◎年  代")
movie['year']=info
elif info.startswith("◎产  地"):
info=page_info(info,"◎产  地")
movie['country']=info
elif info.startswith("◎类  别"):
info=page_info(info,"◎类  别")
movie['类别']=info
elif info.startswith("◎豆瓣评分"):
info=page_info(info,"◎豆瓣评分")
movie['豆瓣评分']=info
elif info.startswith("◎导  演"):
info=page_info(info,"◎导  演")
dirctors=[info]
for x in range(index+1,len(infos)):
dirctor=infos[x].strip()
if dirctor.startswith("◎编  剧"):
break
dirctors.append(dirctor)
movie['导演']=dirctors
elif info.startswith("◎编  剧"):
info=page_info(info,"◎编  剧")
movie["编剧"]=info
elif info.startswith("◎主  演"):
info=page_info(info,"◎主  演")
actors=[info]
for x in range(index+1,len(infos)):
actor=infos[x].strip()
if(actor.startswith("◎标  签")):
break
actors.append(actor)
movie['actors']=actors
elif info.startswith("◎简  介"):
info=page_info(info,"◎简  介")
for x in range(index+1,len(infos)):
detail=infos[x].strip()
if(detail.startswith("◎获奖情况")):
break
movie['detail']=detail
download_url=html.xpath("//td[@bgcolor='#fdfddf']//a/text()")
movie['download_url']=download_url
return movie
def spider():
#爬取前7页
base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"
movies=[]
for x in range(1,8):
#第一个循环获取每个页面
url=base_url.format(x)
detail_urls=get_detail_url(url)
for detail_url in detail_urls:
#第二个循环获取每个网页电影的详细页
movie=parse_detail_page(detail_url)
movies.append(movie)
print(movies)
if __name__ == '__main__':
spider()


第一步

获取每个网页的电影详细页:


第二步

解析电影详细页:


第三步

爬取前七页:

#
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×