Python 網(wǎng)頁抓取教程 - 如何使用 Python 從任何網(wǎng)站抓取數(shù)據(jù)

被風(fēng)吹過灼思 2021-08-24 15:35:25 瀏覽數(shù) (6896)

反饋

網(wǎng)頁抓取是自動從互聯(lián)網(wǎng)中提取特定數(shù)據(jù)的過程。它有許多用例，例如為機(jī)器學(xué)習(xí)項(xiàng)目獲取數(shù)據(jù)、創(chuàng)建價(jià)格比較工具或任何其他需要大量數(shù)據(jù)的創(chuàng)新想法。雖然理論上您可以手動進(jìn)行數(shù)據(jù)提取，但互聯(lián)網(wǎng)的大量內(nèi)容使這種方法在許多情況下不切實(shí)際。因此，知道如何構(gòu)建網(wǎng)絡(luò)爬蟲可以派上用場。這篇文章的目的是教你如何用 Python 創(chuàng)建一個網(wǎng)頁爬蟲。您將學(xué)習(xí)如何檢查網(wǎng)站以準(zhǔn)備抓取、使用 BeautifulSoup 提取特定數(shù)據(jù)、使用 Selenium 等待 JavaScript 渲染，以及將所有內(nèi)容保存在新的 JSON 或 CSV 文件中。

但首先，我應(yīng)該警告您網(wǎng)絡(luò)抓取的合法性。雖然抓取行為是合法的，但您可能提取的數(shù)據(jù)使用可能是非法的。確保你沒有爬?。?/p>

受版權(quán)保護(hù)的內(nèi)容 – 由于它是某人的知識產(chǎn)權(quán)，因此受法律保護(hù)，您不能只是重復(fù)使用它。
個人數(shù)據(jù)——如果您收集的信息可用于識別個人身份，則它被視為個人數(shù)據(jù)，對于歐盟公民而言，它受 GDPR 保護(hù)。除非您有合法的理由來存儲這些數(shù)據(jù)，否則最好完全跳過它。

一般來說，在抓取之前，您應(yīng)該始終閱讀網(wǎng)站的條款和條件，以確保您不會違反他們的政策。如果您不確定如何繼續(xù)，請聯(lián)系網(wǎng)站所有者并征求同意。

您的Scraper需要什么？

要開始構(gòu)建您自己的網(wǎng)絡(luò)爬蟲，您首先需要在您的機(jī)器上安裝Python。Ubuntu 20.04 和其他版本的 Linux 預(yù)裝了 Python 3。

要檢查您的設(shè)備上是否已經(jīng)安裝了 Python，請運(yùn)行以下命令：

python3 -v

如果您安裝了 Python，您應(yīng)該會收到類似如下輸出：

Python 3.8.2

此外，對于我們的網(wǎng)絡(luò)爬蟲，我們將使用 Python 包 BeautifulSoup（用于選擇特定數(shù)據(jù)）和 Selenium（用于呈現(xiàn)動態(tài)加載的內(nèi)容）。要安裝它們，只需運(yùn)行以下命令：

pip3 install beautifulsoup4

和

pip3 install selenium

最后一步是確保在您的機(jī)器上安裝了 Google Chrome和Chrome 驅(qū)動程序。如果我們想使用 Selenium 抓取動態(tài)加載的內(nèi)容，這些將是必要的。

使用火狐瀏覽器或者其他瀏覽器也需要對應(yīng)的瀏覽器驅(qū)動。

如何檢查頁面

現(xiàn)在你已經(jīng)安裝了所有東西，是時候開始我們的抓取項(xiàng)目了。

您應(yīng)該根據(jù)需要選擇要抓取的網(wǎng)站。請記住，每個網(wǎng)站的內(nèi)容結(jié)構(gòu)都不同，因此當(dāng)您開始自己抓取時，您需要調(diào)整在此處學(xué)到的內(nèi)容。每個網(wǎng)站都需要對代碼進(jìn)行細(xì)微的更改。

對于本文，我決定從 IMDb 的前 250 部電影列表中抓取前十部電影的信息：https : //www.imdb.com/chart/top/。

首先，我們將獲得標(biāo)題，然后我們將通過從每部電影的頁面中提取信息來進(jìn)一步深入研究。一些數(shù)據(jù)將需要 JavaScript 呈現(xiàn)。

要開始了解內(nèi)容的結(jié)構(gòu)，您應(yīng)該右鍵單擊列表中的第一個標(biāo)題，然后選擇“檢查元素”。

通過按 CTRL+F 并在 HTML 代碼結(jié)構(gòu)中搜索，您將看到頁面上只有一個<table>標(biāo)記。這很有用，因?yàn)樗鼮槲覀兲峁┝擞嘘P(guān)如何訪問數(shù)據(jù)的信息。

一個 HTML 選擇器將為我們提供頁面中的所有標(biāo)題?table tbody tr td.titleColumn a?。那是因?yàn)樗袠?biāo)題都位于具有“?titleColumn?”類的表格單元格內(nèi)的錨點(diǎn)中。

使用這個 CSS 選擇器并獲取每個錨點(diǎn)的?innerText?將為我們提供我們需要的標(biāo)題。您可以在剛剛打開的新窗口中使用 JavaScript 行在瀏覽器控制臺中模擬：

document.querySelectorAll("table tbody tr td.titleColumn a")[0].innerText

你會看到這樣的結(jié)果：

現(xiàn)在我們有了這個選擇器，我們可以開始編寫 Python 代碼并提取我們需要的信息。

如何使用 BeautifulSoup 提取靜態(tài)加載的內(nèi)容

我們列表中的電影標(biāo)題是靜態(tài)內(nèi)容。這是因?yàn)槿绻榭错撁嬖创a（頁面上的 CTRL+U 或右鍵單擊然后選擇查看頁面源代碼），您將看到標(biāo)題已經(jīng)存在。

靜態(tài)內(nèi)容通常更容易抓??取，因?yàn)樗恍枰?JavaScript 渲染。為了提取列表中的前十個標(biāo)題，我們將使用 BeautifulSoup 獲取內(nèi)容，然后將其打印在我們的Scraper的輸出中。

import requests
from bs4 import BeautifulSoup
 
page = requests.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(page.content, 'html.parser') # Parsing content using beautifulsoup
 
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

上面的代碼使用我們在第一步中看到的選擇器從頁面中提取電影標(biāo)題錨點(diǎn)。然后循環(huán)遍歷前十個并顯示每個的innerText。

輸出應(yīng)如下所示：

如何提取動態(tài)加載的內(nèi)容

隨著技術(shù)的進(jìn)步，網(wǎng)站開始動態(tài)加載其內(nèi)容。這提高了頁面的性能、用戶的體驗(yàn)，甚至消除了爬蟲的額外障礙。

但是，這使事情變得復(fù)雜，因?yàn)閺暮唵握埱笾袡z索到的 HTML 將不包含動態(tài)內(nèi)容。幸運(yùn)的是，有了Selenium，我們可以在瀏覽器中模擬一個請求，等待動態(tài)內(nèi)容顯示出來。

如何使用 Selenium 進(jìn)行請求

您需要知道 chromedriver 的位置。以下代碼與第二步中的代碼相同，但這次我們使用 Selenium 發(fā)出請求。我們?nèi)匀粫褚郧耙粯邮褂?BeautifulSoup 解析頁面的內(nèi)容。

from bs4 import BeautifulSoup
from selenium import webdriver
 
option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)
 
driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup. Notice driver.page_source instead of page.content
 
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

不要忘記將“YOUR-PATH-TO-CHROMEDRIVER”替換為您提取 chromedriver 的位置。此外，您應(yīng)該注意到page.content，當(dāng)我們創(chuàng)建 BeautifulSoup 對象時，我們現(xiàn)在使用的是driver.page_source，它提供頁面的 HTML 內(nèi)容。

如何使用 Selenium 提取靜態(tài)加載的內(nèi)容

使用上面的代碼，我們現(xiàn)在可以通過調(diào)用每個錨點(diǎn)上的 click 方法來訪問每個電影頁面。

first_link = driver.find_elements_by_css_selector('table tbody tr td.titleColumn a')[0]
first_link.click()

這將模擬點(diǎn)擊第一部電影的鏈接。但是，在這種情況下，我建議您繼續(xù)使用driver.get instead. 這是因?yàn)閏lick()進(jìn)入不同頁面后您將無法再使用該方法，因?yàn)樾马撁鏇]有指向其他九部電影的鏈接。

因此，單擊列表中的第一個標(biāo)題后，您需要返回第一頁，然后單擊第二頁，依此類推。這是對性能和時間的浪費(fèi)。相反，我們將只使用提取的鏈接并一一訪問它們。

對于“肖申克的救贖”，電影頁面將是https://www.imdb.com/title/tt0111161/。我們將從頁面中提取電影的年份和時長，但這次我們將使用 Selenium 的函數(shù)而不是 BeautifulSoup 作為示例。在實(shí)踐中，您可以使用任何一種，因此請選擇您最喜歡的。

要檢索電影的年份和持續(xù)時間，您應(yīng)該重復(fù)我們在電影頁面上執(zhí)行的第一步。

您會注意到您可以在帶有類?ipc-inline-list?（“?.ipc-inline-list?”選擇器）的第一個元素中找到所有信息，并且列表中的所有元素都有role屬性值presentation（?[role=’presentation’]?選擇器）。

from bs4 import BeautifulSoup
from selenium import webdriver
 
option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)
 
page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup
 
totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page
    infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Find the first element with class ‘ipc-inline-list’
    informations = infolist.find_elements_by_css_selector("[role='presentation']") # Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text,
        "year": informations[0].text,
        "duration": informations[2].text,
    } # Save all the scraped information in a dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list
    
print(totalScrapedInfo) # Display the list with all the information we scraped

如何使用 Selenium 提取動態(tài)加載的內(nèi)容

網(wǎng)絡(luò)抓取的下一個重要步驟是提取動態(tài)加載的內(nèi)容。您可以在編輯列表部分的每個電影頁面（例如https://www.imdb.com/title/tt0111161/）上找到此類內(nèi)容。

如果您在頁面上使用檢查，您會看到您可以找到該部分作為屬性?data-testid?設(shè)置為的元素?firstListCardGroup-editorial?。但是如果你查看頁面源代碼，你不會在任何地方找到這個屬性值。這是因?yàn)榫庉嬃斜聿糠质怯?IMDB 動態(tài)加載的。

在下面的示例中，我們將抓取每部電影的編輯列表，并將其添加到我們當(dāng)前的總抓取信息結(jié)果中。

為此，我們將導(dǎo)入更多包，以便等待我們的動態(tài)內(nèi)容加載。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)
 
page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup
 
totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page 
    infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Find the first element with class ‘ipc-inline-list’
    informations = infolist.find_elements_by_css_selector("[role='presentation']") # Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text,
        "year": informations[0].text,
        "duration": informations[2].text,
    } # Save all the scraped information in a dictionary
    WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[data-testid='firstListCardGroup-editorial']")))  # We are waiting for 5 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial`
    listElements = driver.find_elements_by_css_selector("[data-testid='firstListCardGroup-editorial'] .listName") # Extracting the editorial lists elements
    listNames = [] # Creating an empty list and then appending only the elements texts
    for el in listElements:
        listNames.append(el.text)
    scrapedInfo['editorial-list'] = listNames # Adding the editorial list names to our scrapedInfo dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list
    
print(totalScrapedInfo) # Display the list with all the information we scraped

對于前面的示例，您應(yīng)該獲得以下輸出：

如何保存抓取的內(nèi)容

現(xiàn)在我們擁有了所需的所有數(shù)據(jù)，我們可以將其保存為 .json 或 .csv 文件，以便于閱讀。

為此，我們將只使用 Python 中的 JSON 和 CVS 包并將我們的內(nèi)容寫入新文件：

import csv
import json
 
...
        
file = open('movies.json', mode='w', encoding='utf-8')
file.write(json.dumps(totalScrapedInfo))
 
writer = csv.writer(open("movies.csv", 'w'))
for movie in totalScrapedInfo:
    writer.writerow(movie.values())

抓取技巧和竅門

雖然到目前為止我們的指南已經(jīng)足夠先進(jìn)，可以處理 JavaScript 渲染場景，但在 Selenium 中還有很多東西需要探索。

在本節(jié)中，我將分享一些可能會派上用場的提示和技巧。

1. 為您的請求計(jì)時

如果您在短時間內(nèi)向服務(wù)器發(fā)送數(shù)百個請求的垃圾郵件，很可能在某個時候會出現(xiàn)驗(yàn)證碼，或者您的 IP 甚至可能被阻止。不幸的是，Python 中沒有解決方法可以避免這種情況。

因此，您應(yīng)該在每個請求之間放置一些超時間隔，以便流量看起來更自然。

import time
import requests
 
page = requests.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
time.sleep(30) # Wait 30 seconds
page = requests.get('https://www.imdb.com/') # Getting page HTML through request

2. 錯誤處理

由于網(wǎng)站是動態(tài)的并且可以隨時更改結(jié)構(gòu)，如果您經(jīng)常使用相同的網(wǎng)絡(luò)抓取工具，錯誤處理可能會派上用場。

try:
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "your selector")))
    break
except TimeoutException:
    # If the loading took too long, print message and try again
    print("Loading took too much time!")

當(dāng)您等待一個元素、提取它時，甚至當(dāng)您只是發(fā)出請求時，try and error 語法會很有用。

3. 截屏

如果您需要隨時獲取正在抓取的網(wǎng)頁的屏幕截圖，可以使用：

driver.save_screenshot(‘screenshot-file-name.png’)

這有助于在您處理動態(tài)加載的內(nèi)容時進(jìn)行調(diào)試。

4. 閱讀文檔

最后但并非最不重要的一點(diǎn)是，不要忘記閱讀Selenium的文檔。該庫包含有關(guān)如何執(zhí)行您可以在瀏覽器中執(zhí)行的大多數(shù)操作的信息。

使用 Selenium，您可以填寫表單、按下按鈕、回答彈出消息以及做許多其他很酷的事情。

如果您遇到新問題，他們的文檔可能是您最好的朋友。

最后的想法

本文的目的是為您提供使用 Python 和 Selenium 和 BeautifulSoup 進(jìn)行網(wǎng)絡(luò)抓取的高級介紹。雖然這兩種技術(shù)仍有許多功能需要探索，但您現(xiàn)在已經(jīng)有了如何開始抓取的堅(jiān)實(shí)基礎(chǔ)。

有時網(wǎng)頁抓取可能非常困難，因?yàn)榫W(wǎng)站開始在開發(fā)人員的道路上設(shè)置越來越多的障礙。其中一些障礙可能是驗(yàn)證碼、IP 塊或動態(tài)內(nèi)容。僅使用 Python 和 Selenium 來克服它們可能很困難，甚至是不可能的。

所以，我也給你一個替代方案。嘗試使用Web 抓取 API來為您解決所有這些挑戰(zhàn)。它還使用輪換代理，因此您不必?fù)?dān)心在請求之間添加超時。請記住始終檢查您想要的數(shù)據(jù)是否可以合法提取和使用。

Python

0 人點(diǎn)贊