python爬蟲scrapy項目(二) 爬取目標:房天下全國租房信息網站(起始url:http://zu.fang.com/cities.aspx) 爬取內容:城市;名字;出租方式;價格;戶型;面積;地址;交通 反反爬措施:設置隨機user-agent、設置請求延時操作、 1、開始創建項目 2、進入 ...
python爬蟲scrapy項目(二)
爬取目標:房天下全國租房信息網站(起始url:http://zu.fang.com/cities.aspx)
爬取內容:城市;名字;出租方式;價格;戶型;面積;地址;交通
反反爬措施:設置隨機user-agent、設置請求延時操作、
1、開始創建項目
1 scrapy startproject fang
2、進入fang文件夾,執行啟動spider爬蟲文件代碼,編寫爬蟲文件。
1 scrapy genspider zufang "zu.fang.com"
命令執行完,用Python最好的IDE---pycharm打開該文件目錄
3、編寫該目錄下的items.py文件,設置你需要爬取的欄位。
1 import scrapy 2 3 4 class HomeproItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 8 city = scrapy.Field() #城市 9 title = scrapy.Field() # 名字 10 rentway = scrapy.Field() # 出租方式 11 price = scrapy.Field() #價格 12 housetype = scrapy.Field() # 戶型 13 area = scrapy.Field() # 面積 14 address = scrapy.Field() # 地址 15 traffic = scrapy.Field() # 交通
4、進入spiders文件夾,打開hr.py文件,開始編寫爬蟲文件
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from homepro.items import HomeproItem 4 from scrapy_redis.spiders import RedisCrawlSpider 5 # scrapy.Spider 6 class HomeSpider(RedisCrawlSpider): 7 name = 'home' 8 allowed_domains = ['zu.fang.com'] 9 # start_urls = ['http://zu.fang.com/cities.aspx'] 10 11 redis_key = 'homespider:start_urls' 12 def parse(self, response): 13 hrefs = response.xpath('//div[@class="onCont"]/ul/li/a/@href').extract() 14 for href in hrefs: 15 href = 'http:'+ href 16 yield scrapy.Request(url=href,callback=self.parse_city,dont_filter=True) 17 18 19 def parse_city(self, response): 20 page_num = response.xpath('//div[@id="rentid_D10_01"]/span[@class="txt"]/text()').extract()[0].strip('共頁') 21 # print('*' * 100) 22 # print(page_num) 23 # print(response.url) 24 25 for page in range(1, int(page_num)): 26 if page == 1: 27 url = response.url 28 else: 29 url = response.url + 'house/i%d' % (page + 30) 30 print('*' * 100) 31 print(url) 32 yield scrapy.Request(url=url, callback=self.parse_houseinfo, dont_filter=True) 33 34 def parse_houseinfo(self, response): 35 divs = response.xpath('//dd[@class="info rel"]') 36 for info in divs: 37 city = info.xpath('//div[@class="guide rel"]/a[2]/text()').extract()[0].rstrip("租房") 38 title = info.xpath('.//p[@class="title"]/a/text()').extract()[0] 39 rentway = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[0].extract().replace(" ", '').lstrip('\r\n') 40 housetype = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[1].extract().replace(" ", '') 41 area = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[2].extract().replace(" ", '') 42 addresses = info.xpath('.//p[@class ="gray6 mt12"]//span/text()').extract() 43 address = '-'.join(i for i in addresses) 44 try: 45 des = info.xpath('.//p[@class ="mt12"]//span/text()').extract() 46 traffic = '-'.join(i for i in des) 47 except Exception as e: 48 traffic = "暫無詳細信息" 49 50 p_name = info.xpath('.//div[@class ="moreInfo"]/p/text()').extract()[0] 51 p_price = info.xpath('.//div[@class ="moreInfo"]/p/span/text()').extract()[0] 52 price = p_price + p_name 53 54 item = HomeproItem() 55 item['city'] = city 56 item['title'] = title 57 item['rentway'] = rentway 58 item['price'] = price 59 item['housetype'] = housetype 60 item['area'] = area 61 item['address'] = address 62 item['traffic'] = traffic 63 yield item
5、設置setting.py文件,配置scrapy運行的相關內容
1 # 指定使用scrapy-redis的調度器 2 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 3 4 # 指定使用scrapy-redis的去重 5 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' 6 7 # 指定排序爬取地址時使用的隊列, 8 # 預設的 按優先順序排序(Scrapy預設),由sorted set實現的一種非FIFO、LIFO方式。 9 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' 10 11 REDIS_HOST = '10.8.153.73' 12 REDIS_PORT = 6379 13 # 是否在關閉時候保留原來的調度器和去重記錄,True=保留,False=清空 14 SCHEDULER_PERSIST = True
6、然後把代碼發給其他附屬機器,分別啟動.子程式redis鏈接主伺服器redis。
1 redis-cli -h 主伺服器ip
7、主伺服器先啟動redis-server,再啟動redis-cli
1 lpush homespider:start_urls 起始的url