接下來給大家分享一下常用的正則表達式抓取網(wǎng)絡數(shù)據(jù)的一些技巧。

　　抓取標簽間的內(nèi)容

　　我們前幾篇文章給大家分享了urllib模塊和requests模塊是用來獲取網(wǎng)絡資源的兩個模塊，而我們獲取的網(wǎng)絡資源出了json的之外，都是跟HTML標簽打交道。我們往往要做的就是獲取標簽的內(nèi)容。比如我們獲取一下百度的title標題：

　　import re

　　import requests

　　url = "http://www.baidu.com/"

　　response = requests.get(url)

　　response.encoding='utf-8'

　　content = response.text

　　# 此處使用findall結合正則表達式完成

　　title = re.findall(r'', content)

　　print(title[0])

　　抓取超鏈接標簽間的內(nèi)容

　　import re

　　import requests

　　url = "http://www.baidu.com/"

　　response = requests.get(url)

　　response.encoding='utf-8'

　　content = response.text

　　# 定義正則表達式獲取所有網(wǎng)頁的超鏈接

　　res = r"<a.*?href=.*?<\ a="">"

　　urls = re.findall(res, content)

　　for u in urls:

　　print(u)

　　當然如果想獲取超鏈接中的內(nèi)容我們也可以使用正則表達式，只不過使用了分組的內(nèi)容就是()

　　import re

　　import requests

　　url = "http://www.baidu.com/"

　　response = requests.get(url)

　　response.encoding='utf-8'

　　content = response.text

　　#獲取超鏈接和之間內(nèi)容

　　res = r'(.*?)'

　　texts = re.findall(res, content, re.S|re.M)

　　for t in texts:

　　print(t)

　　觀察結果：

　　抓取標簽中的參數(shù)

　　HTML超鏈接的基本格式為“鏈接內(nèi)容”，現(xiàn)在需要獲取其中的URL鏈接地址，方法如下：

　　import re

　　import requests

　　url = "http://www.baidu.com/"

　　response = requests.get(url)

　　response.encoding='utf-8'

　　content = response.text

　　# 定義正則表達式獲取所有網(wǎng)頁的超鏈接

　　res = r"<a.*?href=.*?<\ a="">"

　　urls = re.findall(res, content)

　　# 將所有的超級鏈接拼接成字符串

　　all_urls = '\n'.join(urls)

　　# 定義正則表達式

　　res = r"(?<=href=)http:.+?(?=\>)|(?<=href=)http:.+?(?=\s)"

　　# 查找符合規(guī)則的超級鏈接

　　urls = re.findall(res, content, re.I|re.S|re.M)

　　for url in urls:

　　print(url)

　　抓取圖片超鏈接標簽的URL

　　HTML插入圖片使用標簽的基本格式為“”，則需要獲取圖片URL鏈接地址，下面???案例不僅獲取的圖片鏈接而且將圖片保存到了本地。

　　import re

　　import requests

　　# 從網(wǎng)絡獲取一張圖片的html標簽

QQ截圖20220908152641

　　# 使用正則表達式獲取src后面的內(nèi)容

　　m = re.match(r'

　　print(m.group(1))

　　image_path = m.group(1)

　　# 如果想下載獲取的圖片鏈接我們結合requests和文件保存完成

　　response = requests.get(image_path)

　　# 獲取響應信息的內(nèi)容

　　result = response.content

　　# 獲取圖片名稱

　　filename = image_path[image_path.rfind('%')+1:]

　　path = os.path.join(r'images', filename)

　　# 保存到本地將圖片

　　with open(path, 'wb') as wstream:

　　wstream.write(result)

　　print('文件下載結束!')

久久精品国产亚洲高清|精品日韩中文乱码在线|亚洲va中文字幕无码久|伊人久久综合狼伊人久久|亚洲不卡av不卡一区二区|精品久久久久久久蜜臀AV|国产精品19久久久久久不卡|国产男女猛烈视频在线观看麻豆

re模塊在爬蟲中的應用