Crawler combat (2) Use Python to crawl Netease cloud playlist

Crawler combat (2) Use Python to crawl Netease cloud playlist

Recently, the blogger has fallen in love with listening to music, but is suffering from not being able to find good music, so he plans to go shopping in NetEase Cloud s playlist

In line with the idea of "using technology to change life", I thought of writing a crawler to crawl NetEase Cloud's playlist and sort it automatically according to the amount of play.

In this article, we will talk about how to crawl the NetEase Cloud playlist and sort the playlist according to the amount of play. The renderings are shown below.

1. Use requests to crawl NetEase cloud playlist

Open Netease cloud music song single home, it is not difficult to find a static page, but the format is very regular, crawling up should be very simple

According to the previous routine, the code can be written quickly, it is nothing more than the following parts:

(1) Get the source code of the webpage

Here we use requests to send and receive requests, the core code is as follows:

import requests def get_page ( url ): # Construct request header headers = { 'USER-AGENT' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } # Send request, get response response = requests.get(url=url,headers=headers) # Get response content html = response.text return htmlcopy code

(2) Analyze the source code of the webpage

For the parsing data part, we use xpath (for those who are not familiar with the syntax of xpath, you can look at the blogger s previous article)

The core code is as follows:

from lxml import etree # parse the source code of the webpage and get the data def parse4data ( self,html ): html_elem = etree.HTML(html) # Play volume play_num = html_elem.xpath( '//ul[@id="m-pl-container"]/li/div/div/span[@class="nb"]/text()' ) # Playlist Name song_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@title' ) # Playlist link song_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@href' ) song_link = [ 'https://music.163.com/#' +item for item in song_href] # User name user_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@title' ) # User link user_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@href' ) user_link = [ 'https://music.163.com/#' +item for item in user_href] # Pack the data into a list, where each element in the list is a dictionary, and each dictionary corresponds to a piece of playlist information data = list ( map ( lambda a,b,c,d,e:{ 'play volume' :a, 'playlist name' :b, ' playlist link' :c, 'user name' :d, 'user Link' :e),play_num,song_title,song_link,user_title,user_link)) # return data return data # Parse the source code of the web page and get the link to the next page def parse4link ( self,html ): html_elem = etree.HTML(html) # Next link href = html_elem.xpath( '//div[@id="m-pl-pager"]/div[@class="u-page"]/a[@class="zbtn znxt"]/@href' ) # If it is empty, it returns None; if it is not empty, it returns the link address if not href: return None else : return'https : //music.163.com/# ' + href[ 0 ] Copy code

(3) Complete code

import requests from lxml import etree import json import time import random class Netease_spider : # Initialize data def __init__ ( self ): self.originURL = 'https ://music.163.com/#/discover/playlist ' self.data = list () # Get webpage source code def get_page ( self,url ): headers = { 'USER-AGENT' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } response = requests.get(url=url,headers=headers) html = response.text return html # Parse the source code of the webpage and get the data def parse4data ( self,html ): html_elem = etree.HTML(html) play_num = html_elem.xpath( '//ul[@id="m-pl-container"]/li/div/div/span[@class="nb"]/text()' ) song_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@title' ) song_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@href' ) song_link = [ 'https://music.163.com/#' +item for item in song_href] user_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@title' ) user_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@href' ) user_link = [ 'https://music.163.com/#' +item for item in user_href] data = list ( map ( lambda a,b,c,d,e:{ 'play volume' :a, 'playlist name' :b, ' playlist link' :c, 'user name' :d, 'user Link' :e),play_num,song_title,song_link,user_title,user_link)) return data # Parse the source code of the web page and get the link to the next page def parse4link ( self,html ): html_elem = etree.HTML(html) href = html_elem.xpath( '//div[@id="m-pl-pager"]/div[@class="u-page"]/a[@class="zbtn znxt"]/@href' ) if not href: return None else : return'https : //music.163.com/# ' + href[ 0 ] # Start crawling web pages def crawl ( self ): # crawl data print ( 'crawl data' ) html = self.get_page(self.originURL) data = self.parse4data(html) self.data.extend(data) link = self.parse4link(html) while (link): html = self.get_page(link) data = self.parse4data(html) self.data.extend(data) link = self.parse4link(html) time.sleep(random.random()) # Process data, sort by play volume print ( 'Processing data' ) data_after_sort = sorted (self.data,key = lambda item: int (item[ 'play volume' ].replace( '10 , 000' , '0000' )),reverse = True ) # write file print ( 'write file' ) with open ( 'netease.json' , 'w' ,encoding = 'utf-8' ) as f: for item in data_after_sort: json.dump(item,f,ensure_ascii = False ) if __name__ == ' __main__ ' : spider = Netease_spider() spider.crawl() Print ( 'Finished' ) copying the code

2. Use selenium to crawl NetEase cloud playlist

However, is it really that simple?

When we run the above code, we will find that after parsing the source code part of the webpage, the returned list is actually an empty list!

Why is this? Knock on the key, hit the key, hit the key, this is definitely a pit!

We reopen the browser and carefully observe the source code of the webpage

It turns out that the elements we extracted are contained within the <iframe> tag, so we cannot directly locate them

Because the iframe will load another page in the original page, when we need to get the elements of the embedded page, we need to switch to the iframe first

After understanding the principle, re-modify the above code

The idea is to use selenium to get the original web page, and then use

switch_to.frame()
The method switches to the iframe and returns to the embedded webpage

The place that needs to be modified is the function to obtain the source code of the webpage. In addition, the webdriver needs to be instantiated in the function of initializing the data. The complete code is as follows:

from selenium import webdriver from lxml import etree import json import time import random class Netease_spider : # Initialize data (need to modify) def __init__ ( self ): # Start selenium without head opt = webdriver.chrome.options.Options() opt.set_headless() self.browser = webdriver.Chrome(chrome_options=opt) self.originURL = 'https://music.163.com/#/discover/playlist ' self.data = list () # Get web page source code (need to modify) def get_page ( self,url ): self.browser.get(url) self.browser.switch_to.frame( 'g_iframe' ) html = self.browser.page_source return html # Parse the source code of the webpage and get the data def parse4data ( self,html ): html_elem = etree.HTML(html) play_num = html_elem.xpath( '//ul[@id="m-pl-container"]/li/div/div/span[@class="nb"]/text()' ) song_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@title' ) song_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[1]/a/@href' ) song_link = [ 'https://music.163.com/#' +item for item in song_href] user_title = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@title' ) user_href = html_elem.xpath( '//ul[@id="m-pl-container"]/li/p[2]/a/@href' ) user_link = [ 'https://music.163.com/#' +item for item in user_href] data = list ( map ( lambda a,b,c,d,e:{ 'play volume' :a, 'playlist name' :b, ' playlist link' :c, 'user name' :d, 'user Link' :e),play_num,song_title,song_link,user_title,user_link)) return data # Parse the source code of the web page and get the link to the next page def parse4link ( self,html ): html_elem = etree.HTML(html) href = html_elem.xpath( '//div[@id="m-pl-pager"]/div[@class="u-page"]/a[@class="zbtn znxt"]/@href' ) if not href: return None else : return'https : //music.163.com/# ' + href[ 0 ] # Start crawling web pages def crawl ( self ): # crawl data print ( 'crawl data' ) html = self.get_page(self.originURL) data = self.parse4data(html) self.data.extend(data) link = self.parse4link(html) while (link): html = self.get_page(link) data = self.parse4data(html) self.data.extend(data) link = self.parse4link(html) time.sleep(random.random()) # Process data, sort by play volume print ( 'Processing data' ) data_after_sort = sorted (self.data,key = lambda item: int (item[ 'play volume' ].replace( '10 , 000' , '0000' )),reverse = True ) # write file print ( 'write file' ) with open ( 'netease.json' , 'w' ,encoding = 'utf-8' ) as f: for item in data_after_sort: json.dump(item,f,ensure_ascii = False ) if __name__ == ' __main__ ' : spider = Netease_spider() spider.crawl() Print ( 'Finished' ) copying the code

In this way, the current top ten playlists in NetEase Cloud Music are as follows (haha, you can listen to songs happily again):

  1. TOP100 Hottest New Songs of 2018
  2. I heard that you are also looking for good Chinese songs
  3. Featured | Internet hot song sharing
  4. The loneliness of the old fan next door
  5. Gentle crit | Indulge in the sweet town of boyfriend voice
  6. Who says cover songs are not good
  7. If you have nostalgic dreams, don't end up without disease
  8. KTV must-have: Is there a song, tears ran away as you sang
  9. Make up and take pictures BGM.
  10. The storytelling male voice sings lyrics too much like himself