Modern Poetry 外国诗爬取过程
Table of Contents
前言⌗
数据格式⌗
数据格式的构造可以去查看之前的Modern poetry 现代诗数据库爬取过程
获取数据⌗
Modern poetry 的外国诗数据部分仍然是来自中国诗歌库的外文诗集。网上其实还有其它的诗歌网站,但是因为版权原因不好爬取。所以如果又想要提交诗歌数据,一定要先注意版权内容。
爬取这个网站的难度只有一点——编码
因为外文诗歌的语言类型繁杂,编码的选取就异常重要。与此同时,因为网站本身数据采集的时候就有乱码,爬取下来的数据需要进行特殊处理(替换掉乱码内容)。
对于网站上的大部分语言(西欧语言),大多数情况下都可以选取 ISO-8859-15
编码来爬取网站并保存数据。针对俄语就直接采用 utf-8
即可。(甚至 gbk
都可以正常显示,不愧是友好联邦!)
import uuid
import re
import requests
import json
requests.adapters.DEFAULT_RETRIES = 3
s = requests.session()
s.proxies = {"http": "115.159.31.195:8080", "http": "116.196.115.209:8080","http":"119.41.236.180:8010"}
s.keep_alive = False
link = 'https://www.shigeku.org/shiku/ws/ww/index.htm'
headers = { 'Connection': 'close',"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3464.0 Safari/537.36"}
def findAll(regex, seq):
resultlist=[]
pos=0
while True:
result = regex.search(seq, pos)
if result is None:
break
resultlist.append(seq[result.start():result.end()])
pos = result.start()+1
return resultlist
def parse(List):
while '' in List:
List.remove('')
return List
def cleantxt(raw):
return re.sub(r'[^\x00-\x7f]', '', raw).strip()
def parseString(string):
str_= ''
flag = 1
for ele in string:
if ele == "<":
flag = 0
elif ele == '>':
flag = 1
continue
if flag == 1:
str_ += ele
str_ = str_.replace('\r','').replace(";",'').replace(" ",'').replace(u"\u0081",'').replace(u"\u008b",'').replace(u"\u008a",'').replace(""",'').strip()
return str_
def author():
print("Start!")
html = s.get(link, headers=headers)
if(html.status_code == requests.codes.ok):
txt = html.text
authorCountry = re.findall('<p align=left>(.*?)</p>', txt, re.S)
authorCountry = parse(authorCountry)
authorList = re.findall('<div id="navcontainer">(.*?)</div>', txt, re.S)
for i in range(0, 18):
authorListFinal = []
country = authorCountry[i]
country = cleantxt(parseString(country))
nameListPre = authorList[i+2]
nameList = re.findall('<li id="navlistli1">(.*?)</li>', nameListPre, re.S)
for k in nameList:
name = parseString(k)
src = re.findall('<a href="(.*?)"', k, re.S)
src = src[0]
index = name.find("(")
if index != -1:
name = name[name.find("by")+3:index-1]
else:
name = name[name.find("by")+3:]
authorDict = {}
idAuthor = uuid.uuid3(uuid.NAMESPACE_URL, name)
authorDict['name'] = name
authorDict['src'] = src.replace('.htm','')
authorDict['id'] = str(idAuthor)
authorDict['description'] = ""
authorListFinal.append(authorDict)
print("Finish ", country)
json.dump(authorListFinal,open(country + '-author.json','w'), ensure_ascii=False)
print("Finish!")
def poem():
authorPoemPre = json.load(open('author.json', 'r'))
prefix = "https://www.shigeku.org/shiku/ws/ww"
for i in range(3,len(authorPoemPre)):
poemList = []
dictAuthor = authorPoemPre[i]
src = dictAuthor['src'] + '.htm'
poemHtml = s.get(prefix + '/' + src, headers=headers)
print("Download finish!")
poemHtml.encoding = 'ISO-8859-1'
txt = poemHtml.text
pattern = re.compile("<hr>(.*?)<hr>",re.S)
tempHrList = findAll(pattern, txt)
for m in tempHrList:
poem = {"author":dictAuthor['name']}
content = parse(parseString(m).split('\n'))
for k in range(0,len(content)):
content[k] = content[k].strip()
if k > 0:
for a in range(0,10):
content[k] = content[k].replace(str(a),'')
content = parse(content)
title = content[0]
content = content[1:]
with open("content.txt", 'a',encoding="iso-8859-1") as fp:
for k in content:
fp.write(k + " ")
poem['title'] = title
poem['paragraphs'] = content
poem['id'] = dictAuthor['id']
poemList.append(poem)
print("Finish ",dictAuthor['name'])
json.dump(poemList,open(dictAuthor['name'] + '.json','w',encoding="ISO-8859-1"), ensure_ascii=False)
print("Finish!")
author()
poem()
text()
这个程序会首先把诗人依照国家分类保存到 Json 文件里,接着会从每个国家的 Json 里读取诗人信息,爬取该诗人的诗歌,并以该诗人的名字命名。(注意:当需要获取诗歌信息的时候要把对应国家的 Json 文件更改为 author.json )
针对服务器拒绝连接的问题,这个程序设置了 proxies (来源站大爷 - 免费代理 IP),如果还是遇到中断的问题,可以重新启动程序。
数据分析⌗
数据分析必然还是不可缺少的,这里因为语言差异较大,我仅针对英国以及美国诗人做了词云的分析。
from wordcloud import WordCloud
import PIL.Image as image
def analyze(file):
with open(file,encoding="iso-8859-15") as fp:
text = fp.read()
wordcloud = WordCloud(background_color=(255,255,255), width=1600, height=800).generate(text)
image_produce = wordcloud.to_image()
image_produce.save('cloud.png',quality=95,subsampling=0)
image_produce.show()
analyze('cloud.txt')
相比中文来说,英文的词云分析就比较简单了,不需要分词,直接分析即可。结果如下:
Read other posts