[python] [VI coding] 第二十章網路爬蟲

第二十章網路爬蟲

網路爬蟲 (web crawler) 就是在網際網路上瀏覽網站的機器人程式，它們取得網站上的公開資訊後，提供給程式開發者運用。

在本章中，直接實作加上講解，相關理論與深入的應用就請大家自己搜尋閱讀了。

20-0 專案目標

目標是利用教育部重編國語辭典的網站來查詢字詞，並將結果儲存成檔案，供 NVDA 快速地瀏覽查詢結果。

原本網頁的操作流程如下：

輸入查詢字詞
選擇查詢條件
送出查詢
瀏覽查詢出的字詞與正確注音
點擊字詞連結查看解釋

其實網頁本身也不複雜，但如果希望在同一個檔案瀏覽所有結果，就需要客製化了。

先限縮三種資訊是專案需要的，就是字詞、注音與解釋，將透過爬蟲取得資訊後，做以下處理，存成一個 html 檔：

h1 標題：查詢的字詞關鍵字
h2 標題：每組字詞與注音查詢結果
字詞下方就是解釋的內容

這樣設計過後，使用 NVDA 來瀏覽，只要按大鍵盤數字 2 就可以跳到每組字詞，想查看解釋就直接使用下方向鍵即可。

20-1 取得網頁資訊

利用 urllib 模組的 urlopen() 函數來讀取網頁內容及相關資訊。

先用 pip 安裝相關套件：

pip install builtwith

接著就能使用 urlopen() 來讀取網頁內容：

import urllib.request
import urllib.error

# 教育部重編國語辭典的網頁網址
url = 'https://dict.revised.moe.edu.tw/search.jsp?md=1'
# 設定網頁超過 3 秒還打不開就報錯
timeout = 3
# 進行連線並取得資料，因為 timeout 是非必要參數，還有其他非必要參數，所以指定參數名稱會比較清楚
response = urllib.request.urlopen(url, timeout = timeout)
# 回傳連線的網頁狀態碼
print(response.getcode())
# 200

先確認開啟網頁的狀態是否正常，回應的代碼 200 代表正常。

爬資料的過程中，有可能網站故障，或者網路阻塞，因此先確認有開啟成功再進行下一步動作，會比較穩妥。

其他的網頁狀態碼可以看這篇。其實還有一些回應的相關資訊可供參考，如下範例：

# 回傳開啟的 url，沒意外的話就是我們指定的網址，不過有些網站會自動轉址
print(response.geturl())
# https://dict.revised.moe.edu.tw/search.jsp?md=1
# 回傳網頁的一些資訊，不是網頁的資料內容喔
print(response.info())
# Cache-Control: private
# Expires: Thu, 01 Jan 1970 00:00:00 GMT
# Strict-Transport-Security: max-age=0
# X-Frame-Options: SAMEORIGIN
# X-Content-Type-Options: nosniff
# X-XSS-Protection: 1
# Set-Cookie: JSESSIONID=AE07EFBABA9C053CAD669CE45F8E661F; Path=/; Secure; HttpOnly
# Date: Fri, 03 Dec 2021 03:13:22 GMT
# Content-Language: zh-Hant-TW
# Server-Timing: total;dur=13.6243
# vary: accept-encoding
# Content-Type: text/html;charset=UTF-8
# Transfer-Encoding: chunked
# Connection: close

如果查詢的網址是錯的，會怎麼樣？

url = 'https://dact.revised.moe.edu.tw/search.jsp?md=1'
response = urllib.request.urlopen(url)
# urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>

上例中，把網址的 dict 改成 dact 了，結果直接彈出 URLError 錯誤，也看出為什麼一開始要引入 urllib.error，這樣可以直接用 try-except 處理這個狀況。

接著顯示網頁的內容：

for i in range(10):
	print(response.readline())
# b'<!DOCTYPE html>\r\n'
# b"<html lang='zh-Hant-TW'>\r\n"
# b'<head>\r\n'
# b'<meta charset="UTF-8" />\r\n'
# b'<!-- <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> -->\r\n'
# b'<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=5"
# />\r\n'
# b'<!-- \xe5\x9c\xb0\xe7\x90\x86 -->\r\n'
# b'<!-- \r\n'
# b'<meta name="ICBM" content="24.9817731,121.3850242" />\r\n'
# b'<meta name="geo.position" content="24.9817731,121.3850242" />\r\n'

因為一個網頁的資料內容很多，就先顯示 10 行讓大家確認即可，也可以把所有的內容都寫入另外一個文字檔內，再慢慢閱讀。

回傳的內容是 bytes 字串，已經知道可使用 decode() 方法來轉換成一般字串，但爬回來的內容是網頁原始碼，也就是 html 程式碼。

依據這個結果，一般使用瀏覽器看到的網頁內容，是瀏覽器解析 html 與其他相關程式後產生的最終結果。因此，直接取回來的資料需要經過處理才有利用價值。

當然，也能把取得的內容存成 html 網頁檔，直接用瀏覽器來開啟該檔案，如此就能直接瀏覽與操作。

另外，urlopen 可支援的協定包括 http, https, ftp 等。

20-2 取得查詢結果

上一節使用的網站是教育部重編國語辭典，網頁裡有個編輯區，可以輸入想查詢的字詞來檢索。我們希望爬蟲可以直接獲取查詢字詞的結果。

試著在編輯區輸入查詢的文字，例如「感情」，送出查詢後，觀察網址的變化。

https://dict.revised.moe.edu.tw/search.jsp?md=1&word=感情&qMd=0&qCol=1
# & (and) 符號是網址中參數與參數之間的連接符號，學過運算子的伙伴應該不陌生，不過，這是網址的格式，與 python 無關。

上列網址是查詢「感情」的結果頁面網址，word 參數是我們查詢的字詞，那 qMd 與 qCol 應該是查詢時可以選擇的模式。

回到一開始的檢索頁面，在搜尋條件裡，我們從「關鍵字檢索」改為「義類檢索」，這是二選一的單選鈕。送出查詢，發現網址的 qMd=0 變成 qMd=1 了。

再回到檢索頁面，將「詞目」的核取方塊取消，改勾「音讀」。送出查詢，發現網址的 qCol=1 變成 qCol=2 了。

再回到檢索頁面，除了勾「詞目」外，也勾選「音讀」，送出查詢。發現網址尾端除了有 qCol=1 ，還有 qCol=2 的參數。

從上面的測試，已經知道這個網站如何利用網址搭配搜尋條件，直接到達結果頁面。

如果想設計一個查詢字典的程式，可以讓使用者輸入想查詢的文字，依據資訊組合出網址，就能直接送到重編字典的網站進行查詢。當然，我們的爬蟲也可以直接獲取結果。

20-3 分析資料工具

取得資料並不難，剛才已經順利地取得資料，但困難的是從這些資料中抽取出想要的資訊。

可能需要有 html 程式語言的概念，了解網頁的架構，對於解析出自己想要的資訊會有幫助，特別是網頁架構比較複雜的狀況下。

使用 urllib 可以搭配 scrapy 爬蟲框架和 XPath 過濾器來篩選資料，但它比較像軟體的操作，有興趣的伙伴可以自己研究。

我們這邊介紹另外一套 requests 與 beautiful soup 套件，它可以更靈活地分析資料。所以剛剛只是先試一下水溫，接下來要玩真的了。

其實也可以用原先介紹的 urllib 取得網頁內容，但 requests 取得的字串型態更直接。

先來安裝一下這兩個套件：

pip install requests
pip install bs4

利用 requests 把資料爬回來，並經過排版輸出個幾行瞧瞧：

import requests
from bs4 import BeautifulSoup as bs  # 方法名稱太長了，用 bs 代替

# 查詢「感情」的結果頁面網址
url = 'https://dict.revised.moe.edu.tw/search.jsp?md=1&word=感情&qMd=0&qCol=1'
# 用 get 方式傳遞網站資料
response = requests.get(url)
print(response)
# <Response [200]>
# 看到狀態碼是 200 就放心了
# 取得網頁內容
soup = bs(response.text, 'html.parser')
# 這時 soup 已經有網頁的 html 內容，但我們希望格式化，也就是自動縮排，看起來會比較舒服
source = soup.prettify()
# 確認型態為字串
print(type(source))
# <class 'str'>
# 顯示前 100 個字元。看一下就好，否則內容太多會看到眼花
print(source[:100])
# <!DOCTYPE html>
# <html lang="zh-Hant-TW">
#  <head>
#  <meta charset="utf-8"/>
#  <!-- <meta http-equiv="
# 果然有縮排，看起來舒服很多，但和我們想要的資訊還有一段距離

20-4 搜尋節點

接下來就是針對爬回來的資料做處理，取得我們想要的部分。

20-4-1 以 html 標籤查找

從查詢結果可以看出 h3 標籤的「感情」是搜尋結果的標題，利用 find() 來找這個標籤：

title = soup.find('h3')
print(title)
# <h3>感情</h3>
# 但我們只想拿到去除 html 碼的文字部分，可以使用 getText() 方法
print(title.getText())
# 感情

如果想一次回傳所有找到的結果，可以改用 find_all，它會以列表的型態回傳，也可同時搜尋多個標籤，然後用 for 迴圈來遍歷結果。

20-4-2 以 select 篩選資料

有了標題後，還想要取得查詢結果，也就是一個個字詞的連結，連結元素是 a。但整個網頁的連結很多，由觀察得知，搜尋結果在表格內，因此進一步限縮在表格內的才是所需連結。

# 只找表格裡的所有連結，然後一個個顯示出來
result = soup.find('table')
result = result.select('a')
for a in result:
	print(a)
# <a class="act" href="/search.jsp?md=1&word=%E6%84%9F%E6%83%85&qMd=0&qCol=1">正文(5)</a>
# <a class="noPrint" href="/search.jsp?md=3&word=%E6%84%9F%E6%83%85&qMd=0&qCol=1¬i=1">附錄(2)
#</a>
# <a href="dictView.jsp?ID=70156&q=1&word=%E6%84%9F%E6%83%85"><cr>感情</cr></a>
# <a href="dictView.jsp?ID=70157&q=1&word=%E6%84%9F%E6%83%85"><cr>感情</cr>戲</a>
# <a href="dictView.jsp?ID=70158&q=1&word=%E6%84%9F%E6%83%85"><cr>感情</cr>作用</a>
# <a href="dictView.jsp?ID=70159&q=1&word=%E6%84%9F%E6%83%85"><cr>感情</cr>用事</a>
# <a href="dictView.jsp?ID=131799&q=1&word=%E6%84%9F%E6%83%85">傷<cr>感情</cr></a>

其實連結的 html 結構很簡單，不要被嚇倒，a 就是html 的連結元素名稱，href 屬性是網址，而 <a> 和 </a> 之間的文字是該連結的敘述文字。

仔細看後，前面有兩個不是我們要的連結，再觀察一下，發現只要是查詢出來的字詞結果，連結網址開頭都有 dict 字樣，那就用 get() 方法來判斷 href 屬性，網址開頭有 dict 的才要：

for a in result:
	if 'dict' in a.get('href'):
		print(a.getText())
# 感情  
# 感情戲   
# 感情作用    
# 感情用事    
# 傷感情

這樣就對了。那要怎麼取得每個詞的注音呢？

從 soup 的內容可以看出，在表格內的每一欄都有個 td 包住該欄位的資訊，而第三欄，也就是注音那欄的 td 會有 class='ph' 的字樣，所以只要找出含有 class 是 ph 的 td 內容就是注音：

result = soup.find_all('td', class_='ph')
for c in result:
	print(c.getText())
# ㄍㄢˇ ㄑㄧㄥˊ       
# ㄍㄢˇ ㄑㄧㄥˊ ㄒㄧˋ          
# ㄍㄢˇ ㄑㄧㄥˊ ㄗㄨㄛˋ ㄩㄥˋ              
# ㄍㄢˇ ㄑㄧㄥˊ ㄩㄥˋ ㄕˋ            
# ㄕㄤ ㄍㄢˇ ㄑㄧㄥˊ

因為 class 是 python 的保留字，所以上面是用 class_ (有底線) 來當參數名稱。

20-5 用迴圈爬資料

已經有字詞、注音兩種資訊，只差解釋的部分就收集完成。

在網站中，點進去字詞的網址才能看到解釋，所以需要各個詞語的網址。從剛才取得字詞的 a 連結，可以看到 href 網址如下：

dictView.jsp?ID=131799&q=1&word=%E6%84%9F%E6%83%85

這顯然不是一個完整的網址，因為是本站，可以省略網址開頭的寫法。

但現在是從外部連結，必須補上教育部重編辭典的主網址，也就是從 http 到斜線，再加上 a 連結的那一段，像這樣：

https://dict.revised.moe.edu.tw/dictView.jsp?ID=131799&q=1&word=%E6%84%9F%E6%83%85

測試一下，使用瀏覽器打開它，可以看到「傷感情」的查詢結果，代表推測正確。

裡面的資訊，通常只要「釋義」的部分就好。它位於表格的第四列，第一欄是「釋義」這個標題，第二欄是我們要的內容。若該詞語有相反詞，釋義可能會在第5列，但不管在哪一列，都不影響程式篩選條件。

利用剛才的方式，先爬回「傷感情」這個頁面的資料，把表格的 td 顯示出來觀察。發現「釋義」那欄內容的 td 有個 headers 屬性，值為 col4 ，因此可以這樣做：

import requests
from bs4 import BeautifulSoup as bs  # 方法名稱太長了，用 bs 代替

url = 'https://dict.revised.moe.edu.tw/dictView.jsp?ID=131799&q=1&word=%E6%84%9F%E6%83%85'
response = requests.get(url)
soup = bs(response.text, 'html.parser')
result = soup.find_all('td')
for t in result:
	if 'col4' in t.get('headers'):
		print(t.getText())
# 使原有的交情受到傷害。如：「你說這話就傷感情了！」

是否有發現 getText() 很強大，其實 td 裡面還有一堆 html 標籤的，但都被它自動過濾掉了，讓我們省事不少。

因為必須取得所有查出字詞的「釋義」，所以要使用迴圈，把搜尋到的字詞釋義取回來。

綜上所述，我們取得了想要的各種資訊，最後整合一下，寫出以下程式：

20-6 查詢辭典資訊程式

經過觀察與測試完成了許多程式片段，最後，需要把它們組織成為完整的程式。

20-6-1 主流程

專案的目的很明確，我自己習慣先撰寫出主要流程，等流程定好後，只要按照順序，一一完善功能就成了。

# search.py:
# 引入 Dict.py 檔的 Dict 類別
from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)

流程儘量簡單，把任務一件件拆開，應該把複雜的程式碼包裝在類別方法中，流程的設計就專注在使用即可。

如果流程夠簡潔，加上類別、方法與變數的名稱清楚，開發類別的工程師便能依據流程撰寫出需要的元件，至少讓使用者可以輕易看懂使用的邏輯。

上面的主流程雖然沒有註解，相信大家可以明白這些方法的作用。我們分成兩個檔案，一是 Dict 類別本身，二是主流程，主流程會引入 Dict 類別。

20-6-2 開發策略

從之前所學得知，開發一個功能或方法後便應該進行測試，因此先把還沒用到的程式碼註解掉，並且確認引入的寫法是正確的：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
print(keyword)
'''
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

依據流程，新增一個 Dict 類別，先跳過初始化，測試輸入搜尋文字方法：

# Dict.py:
class Dict():
	def __init__(self):
		pass
	def inputSearch(self):
		return 'abc'

執行 search.py 確認是否能看到 abc:

D:\>python search.py
abc

大家還是可以使用 notepad++ 結合的環境執行程式，這邊只是為了讓各位清楚看到我們做了什麼事情而已。

依照上述的寫法與架構，記得 Dict.py 與 search.py 這兩個檔案要放在同一個資料夾下。

每開發一個方法後，就到主程式將註解的位置移動一下，一一測試所有的方法是否能正常運作。

20-6-3 輸入查詢字詞

初始化的部分，先放上 main_url 變數，用來儲存查詢字詞的網址，其預設值為空字串，因為取得輸入字詞才能產出網址。其他的變數視開發需要再添加即可。

先來完成最簡單的輸入查詢字詞方法：

# 在 Dict.py 檔裡的 Dict 類別中：
	def __init__(self):
		self.main_url = ''
	def inputSearch(self):
		keyword = input('請輸入欲查詢文字：')
		return keyword

執行結果：

請輸入欲查詢文字：感情           
感情

20-6-4 取得搜尋結果

因為在 Dict.py 會用到 requests 與 beautiful soup，所以要記得引入相關模組：

import requests
from bs4 import BeautifulSoup as bs

# 在 Dict.py 檔裡的 Dict 類別中：
	def getData(self, keyword):
		self.main_url = 'https://dict.revised.moe.edu.tw/search.jsp?md=1&word=' + keyword + '&qMd=0&qCol=1'
		response = requests.get(self.main_url)
		soup = bs(response.text, 'html.parser')
		return soup

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
print(str(source_data)[1:100])
'''
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

回傳的 soup 是 beautiful soup 物件，想要直接顯示就需要轉成字串，同樣是因為內容很多，所以只顯示開頭的片段即可。

執行結果：

請輸入欲查詢文字：感情           
!DOCTYPE html>

<html lang="zh-Hant-TW">
<head>
<meta charset="utf-8"/>
<!-- <meta http-equiv="Cont

因為主程式都會包含一些測試程式碼，像 print 之類，所以測試完成後，進行下一個方法的撰寫與測試，就需要把之前測試片段程式碼刪掉。

20-6-5 取得查詢標題

接下來幾個方法其實都是在分析爬回來的資料，取得我們想要的部分，別忘了一個函數或方法只做一件事情的原則。

# 在 Dict.py 檔裡的 Dict 類別中：
	def getTitle(self, data):
		title = data.find('h3')
		return title.getText()

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
print(title)
'''
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
感情

看起來跟測試 inputSearch() 的結果一樣，但過程和程式碼是不同的。

20-6-6 取得連結

# 在 Dict.py 檔裡的 Dict 類別中：
	def getLinks(self, data):
		links = []
		result = data.find('table')
		result = result.select('a')
		for a in result:
			if 'dict' in a.get('href'):
				links.append(a)
		return links

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
for link in links:
	print(link)
'''
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
<a href="dictView.jsp?ID=70156&q=1&word=%E6%84%9F%E6%83%85"><cr>
感情</cr></a>  
<a href="dictView.jsp?ID=70157&q=1&word=%E6%84%9F%E6%83%85"><cr>
感情</cr>戲</a>   
<a href="dictView.jsp?ID=70158&q=1&word=%E6%84%9F%E6%83%85"><cr>
感情</cr>作用</a>    
<a href="dictView.jsp?ID=70159&q=1&word=%E6%84%9F%E6%83%85"><cr>
感情</cr>用事</a>    
<a href="dictView.jsp?ID=131799&q=1&word=%E6%84%9F%E6%83%85">傷<c
r>感情</cr></a>

這一次拿了整個連結的原因，在於連結中包含兩種需要的資訊，一是字詞，二是網址，因此之後兩個方法需要的資訊可以直接從連結中挑選，就不需要從爬回來的資料重複搜尋。

20-6-7 取得字詞

# 在 Dict.py 檔裡的 Dict 類別中：
	def getWords(self, links):
		words = []
		for link in links:
			words.append(link.getText())
		return words

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
for word in words:
	print(word)
'''
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
感情  
感情戲   
感情作用    
感情用事    
傷感情

20-6-8 取得網址

# 在 Dict.py 檔裡的 Dict 類別中：
	def getAddress(self, links):
		address = []
		for link in links:
			address.append(self.main_url.split('search')[0] + link.get('href'))
		return address

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
for a in address:
	print(a)
'''
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
https://dict.revised.moe.edu.tw/dictView.jsp?ID=70156&q=1&word=%E6%84%9F%
E6%83%85
https://dict.revised.moe.edu.tw/dictView.jsp?ID=70157&q=1&word=%E6%84%9F%
E6%83%85
https://dict.revised.moe.edu.tw/dictView.jsp?ID=70158&q=1&word=%E6%84%9F%
E6%83%85
https://dict.revised.moe.edu.tw/dictView.jsp?ID=70159&q=1&word=%E6%84%9F%
E6%83%85
https://dict.revised.moe.edu.tw/dictView.jsp?ID=131799&q=1&word=%E6%84%9F
%E6%83%85

20-6-9 取得注音

# 在 Dict.py 檔裡的 Dict 類別中：
	def getPhonics(self, data):
		phonics = []
		result = data.find_all('td', class_='ph')
		for p in result:
			phonics.append(p.getText())
		return phonics

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
for p in phonics:
	print(p)
'''
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
 ㄍㄢˇ ㄑㄧㄥˊ       
 ㄍㄢˇ ㄑㄧㄥˊ ㄒㄧˋ          
 ㄍㄢˇ ㄑㄧㄥˊ ㄗㄨㄛˋ ㄩㄥˋ              
 ㄍㄢˇ ㄑㄧㄥˊ ㄩㄥˋ ㄕˋ            
 ㄕㄤ ㄍㄢˇ ㄑㄧㄥˊ

20-6-10 取得釋義

# 在 Dict.py 檔裡的 Dict 類別中：
	def getExplain(self, address):
		explain = []
		for url in address:
			response = requests.get(url)
			soup = bs(response.text, 'html.parser')
			result = soup.find_all('td')
			for t in result:
				if 'col4' in t.get('headers'):
					explain.append(t.getText())
		return explain

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
for e in explain:
	print(e)
'''
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)
'''

執行結果：

請輸入欲查詢文字：感情           
受外界刺激所產生的情緒。如：「他太感情用事了。」人與人之間的交情。如：「 
他們二人一向來往密切，感情很好。」觸動情感。《文選．劉伶．酒德頌》：「不 
覺寒暑之切肌，利欲之感情。」被別人的情意所感動，而表示感謝之情。如：「媽 
媽幫了他大忙之後，他感情不過，送了我們一籃水果。」《兒女英雄傳》第二四回 
：「伯父、伯母，今日此舉，不但我父母感情不盡，便是我何玉鳳也受惠無窮。」 
表現內心情感的戲劇。如親情、友情、愛情等。                     
不以事理的正誤曲直作判斷，僅憑心理的好惡而有所行動。如：「你可別因一時感 
情作用而鑄下大錯。」          
憑個人好惡和一時的情感衝動處理事情。如：「你這樣感情用事，於事無補。」   
使原有的交情受到傷害。如：「你說這話就傷感情了！」

20-6-11 格式化資訊

我們把這些資訊格式化成一個字典列表，方便顯示與寫入檔案中：

# 在 Dict.py 檔裡的 Dict 類別中：
	def makeDic(self, words, phonics, explain):
		result = []
		for w, p, e in zip(words, phonics, explain):
			dic = {}
			dic['word'] = w
			dic['phonic'] = p
			dic['explain'] = e
			result.append(dic)
		return result

修改一下主流程，方便測試：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
print(result)
#dict.makeResultHtmlFile(title, result)

執行結果：

請輸入欲查詢文字：感情           
[{'word': '感情', 'phonic': ' ㄍㄢˇ ㄑㄧㄥˊ', 'explain': '受外界刺激所 
產生的情緒。如：「他太感情用事了。」人與人之間的交情。如：「他們二人一向 
來往密切，感情很好。」觸動情感。《文選．劉伶．酒德頌》：「不覺寒暑之切肌 
，利欲之感情。」被別人的情意所感動，而表示感謝之情。如：「媽媽幫了他大忙 
之後，他感情不過，送了我們一籃水果。」《兒女英雄傳》第二四回：「伯父、伯 
母，今日此舉，不但我父母感情不盡，便是我何玉鳳也受惠無窮。」'}, {'word': 
'感情戲', 'phonic': ' ㄍㄢˇ ㄑㄧㄥˊ ㄒㄧˋ', 'explain': '表現內心情感的
戲劇。如親情、友情、愛情等。'}, {'word': '感情作用', 'phonic': ' ㄍㄢˇ  
ㄑㄧㄥˊ ㄗㄨㄛˋ ㄩㄥˋ', 'explain': '不以事理的正誤曲直作判斷，僅憑心理
的好惡而有所行動。如：「你可別因一時感情作用而鑄下大錯。」'}, {'word': ' 
感情用事', 'phonic': ' ㄍㄢˇ ㄑㄧㄥˊ ㄩㄥˋ ㄕˋ', 'explain': '憑個人好
惡和一時的情感衝動處理事情。如：「你這樣感情用事，於事無補。」'}, {'word'
: '傷感情', 'phonic': ' ㄕㄤ ㄍㄢˇ ㄑㄧㄥˊ', 'explain': '使原有的交情受
到傷害。如：「你說這話就傷感情了！」'}]

20-6-12 寫入檔案

最後一個方法就簡單了，把資訊寫入檔案，記得加上必要的 html 標籤與換行符號，並且在初始化方法加入 filename 變數，儲存寫入的檔名：

# 在 Dict.py 檔裡的 Dict 類別中：
	def __init__(self):
		self.main_url = ''
		self.filename = 'dict.html'
	def makeResultHtmlFile(self, title, result):
		with open(self.filename, 'w', encoding='utf-8') as fin:
			fin.write('<h1>' + title + '</h1>' + '\n')
			for r in result:
				fin.write('<h2>' + r['word'] + r['phonic'] + '</h2>' + '\n')
				fin.write(r['explain'] + '\n')

把主流程的測試程式碼與註解都拿掉：

from Dict import Dict

dict = Dict()
keyword = dict.inputSearch()
source_data = dict.getData(keyword)
title = dict.getTitle(source_data)
links = dict.getLinks(source_data)
words = dict.getWords(links)
address = dict.getAddress(links)
phonics = dict.getPhonics(source_data)
explain = dict.getExplain(address)
result = dict.makeDic(words, phonics, explain)
dict.makeResultHtmlFile(title, result)

使用編輯器打開 dict.html 結果檔來看看：

<h1>感情</h1>
<h2>感情 ㄍㄢˇ ㄑㄧㄥˊ</h2>
受外界刺激所產生的情緒。如：「他太感情用事了。」人與人之間的交情。如：「他們二人一向來往密切，感情很好。」觸動情感。《文選．劉伶．酒德頌》：「不覺寒暑之切肌，利欲之感情。」被別人的情意所感動，而表示感謝之情。如：「媽媽幫了他大忙之後，他感情不過，送了我們一籃水果。」《兒女英雄傳》第二四回：「伯父、伯母，今日此舉，不但我父母感情不盡，便是我何玉鳳也受惠無窮。」
<h2>感情戲 ㄍㄢˇ ㄑㄧㄥˊ ㄒㄧˋ</h2>
表現內心情感的戲劇。如親情、友情、愛情等。
<h2>感情作用 ㄍㄢˇ ㄑㄧㄥˊ ㄗㄨㄛˋ ㄩㄥˋ</h2>
不以事理的正誤曲直作判斷，僅憑心理的好惡而有所行動。如：「你可別因一時感情作用而鑄下大錯。」
<h2>感情用事 ㄍㄢˇ ㄑㄧㄥˊ ㄩㄥˋ ㄕˋ</h2>
憑個人好惡和一時的情感衝動處理事情。如：「你這樣感情用事，於事無補。」
<h2>傷感情 ㄕㄤ ㄍㄢˇ ㄑㄧㄥˊ</h2>
使原有的交情受到傷害。如：「你說這話就傷感情了！」

直接使用瀏覽器來開啟 dict.html ，則可以看到當初目標所希望的呈現結果。

當然也能試著輸入其他想查詢的詞語，來檢索一番。

20-7 補充

其實 beautiful soup 還有不少方法可以用，在查找元素及查父元素、同一層裡的前後元素等。因為非常多，本章只依據當下的需求來選擇一些方法。

為讓我們的程式更完整，可以思考以下這些方向：

需要增加錯誤處理
需要加入程式註解
如果遇到一頁的字詞很多，多到換頁該怎麼一次抓取
想要按照字詞字數或注音順序來重新排序顯示，應該怎麼做
能不能將單字解釋的列表格式保存下來
如果執行速度不如預期，應該怎麼改善程式

除了這些，當然還有很多可以改善的地方，就有賴大家一起來完善它了。

總結整個專案進行的流程，透過本章，希望把專案的構思和開發方式分享給大家：

規劃
學習
觀察
小測試
整合
總測試
優化
新增功能

其中有些步驟是一直循環的，直到完成整個程式的開發為止。

動動腦

來優化一些方法

find_all() 可以搭配屬性的查找，並且能使用正規表示法，但要引入 re 模組：

# 改寫 getLinks
from re import compile

	def getLinks(self, data):
		links = data.find_all('a', attrs={'href':compile('^dict')})
		return links

這個方法有巢狀迴圈，試著把第二層的迴圈改掉：

# 改寫 getExplain
	def getExplain(self, address):
		explain = []
		for url in address:
			response = requests.get(url)
			soup = bs(response.text, 'html.parser')
			td = soup.find('td', attrs={'headers':'col4'})
			explain.append(td.getText())
		return explain

影片

第二十章網路爬蟲 part one

第二十章網路爬蟲 part two

結語

這份講義進入尾聲，能夠堅持學到這裡的伙伴是很不容易的。雖然我的所知有限，還是盡力完成所有內容，在編寫講義的過程中也學到不少知識。

寫程式對我來說很有趣，但程式的領域非常廣泛，這份講義只能帶各位入門，創造自學的可能性。想在這條路繼續前進的伙伴，得努力學習，吸收更多知識，增加技術能力。

感謝同事姵君的認真校稿，第一次編寫講義，有許多需要潤飾的地方。

感謝社團法人臺灣視障協會大力支持這個課程，讓我第一次嘗試帶完長達20堂的課程。

感謝助教與顧問們給予我的建議，讓這些課程變得更好。

感謝學員不斷努力地認真學習，希望在寫程式的過程中，大家是樂在其中的。

2021/12 臺北

最後更新：2021-12-17 14:02:17

From： 211.23.21.202

By：特種兵

[python] [VI coding] 第二十章 網路爬蟲 - 教學區