前言

数据格式

数据格式的构造可以去查看之前的Modern poetry 现代诗数据库爬取过程

获取数据

Modern poetry 的外国诗数据部分仍然是来自中国诗歌库的外文诗集。网上其实还有其它的诗歌网站,但是因为版权原因不好爬取。所以如果又想要提交诗歌数据,一定要先注意版权内容。

爬取这个网站的难度只有一点——编码

因为外文诗歌的语言类型繁杂,编码的选取就异常重要。与此同时,因为网站本身数据采集的时候就有乱码,爬取下来的数据需要进行特殊处理(替换掉乱码内容)。

对于网站上的大部分语言(西欧语言),大多数情况下都可以选取 ISO-8859-15 编码来爬取网站并保存数据。针对俄语就直接采用 utf-8 即可。(甚至 gbk 都可以正常显示,不愧是友好联邦!)

import uuid
import re
import requests
import json

requests.adapters.DEFAULT_RETRIES = 3
s = requests.session()
s.proxies = {"http": "115.159.31.195:8080", "http": "116.196.115.209:8080","http":"119.41.236.180:8010"}
s.keep_alive = False
link = 'https://www.shigeku.org/shiku/ws/ww/index.htm'
headers = { 'Connection': 'close',"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3464.0 Safari/537.36"}

def findAll(regex, seq):
    resultlist=[]
    pos=0
    while True:
        result = regex.search(seq, pos)
        if result is None:
            break
        resultlist.append(seq[result.start():result.end()])
        pos = result.start()+1
    return resultlist

def parse(List):
    while '' in List:
        List.remove('')
    return List


def cleantxt(raw):
    return re.sub(r'[^\x00-\x7f]', '', raw).strip()

def parseString(string):
    str_= ''
    flag = 1
    for ele in string:
        if ele == "<":
           flag = 0
        elif ele == '>':
           flag = 1
           continue
        if flag == 1:
            str_ += ele

    str_ = str_.replace('\r','').replace(";",'').replace("&nbsp",'').replace(u"\u0081",'').replace(u"\u008b",'').replace(u"\u008a",'').replace("&quot",'').strip()
    return str_

def author():
    print("Start!")
    html = s.get(link, headers=headers)
    if(html.status_code == requests.codes.ok):
        txt = html.text
        authorCountry = re.findall('<p align=left>(.*?)</p>', txt, re.S)
        authorCountry = parse(authorCountry)
        authorList = re.findall('<div id="navcontainer">(.*?)</div>', txt, re.S)
        for i in range(0, 18):
            authorListFinal = []
            country = authorCountry[i]
            country = cleantxt(parseString(country))
            nameListPre = authorList[i+2]
            nameList = re.findall('<li id="navlistli1">(.*?)</li>', nameListPre, re.S)
            for k in nameList:
                name = parseString(k)
                src = re.findall('<a href="(.*?)"', k, re.S)
                src = src[0]
                index = name.find("(")
                if index != -1:
                    name = name[name.find("by")+3:index-1]
                else:
                    name = name[name.find("by")+3:]
                authorDict = {}
                idAuthor = uuid.uuid3(uuid.NAMESPACE_URL, name)
                authorDict['name'] = name
                authorDict['src'] = src.replace('.htm','')
                authorDict['id'] = str(idAuthor)
                authorDict['description'] = ""
                authorListFinal.append(authorDict)

            print("Finish ", country)
            json.dump(authorListFinal,open(country + '-author.json','w'), ensure_ascii=False)

    print("Finish!")

def poem():
    authorPoemPre = json.load(open('author.json', 'r'))
    prefix = "https://www.shigeku.org/shiku/ws/ww"
    for i in range(3,len(authorPoemPre)):
        poemList = []
        dictAuthor = authorPoemPre[i]
        src = dictAuthor['src'] + '.htm'
        poemHtml = s.get(prefix + '/' + src, headers=headers)
        print("Download finish!")
        poemHtml.encoding = 'ISO-8859-1'
        txt = poemHtml.text
        pattern = re.compile("<hr>(.*?)<hr>",re.S)
        tempHrList = findAll(pattern, txt)
        for m in tempHrList:
            poem = {"author":dictAuthor['name']}

            content = parse(parseString(m).split('\n'))
            for k in range(0,len(content)):
                content[k] = content[k].strip()
                if k > 0:
                    for a in range(0,10):
                        content[k] = content[k].replace(str(a),'')
            content = parse(content)
            title = content[0]
            content = content[1:]
            with open("content.txt", 'a',encoding="iso-8859-1") as fp:
                for k in content:
                    fp.write(k + " ")

            poem['title'] = title
            poem['paragraphs'] = content
            poem['id'] = dictAuthor['id']
            poemList.append(poem)

        print("Finish ",dictAuthor['name'])
        json.dump(poemList,open(dictAuthor['name'] + '.json','w',encoding="ISO-8859-1"), ensure_ascii=False)

    print("Finish!")

author()
poem()
text()

这个程序会首先把诗人依照国家分类保存到 Json 文件里,接着会从每个国家的 Json 里读取诗人信息,爬取该诗人的诗歌,并以该诗人的名字命名。(注意:当需要获取诗歌信息的时候要把对应国家的 Json 文件更改为 author.json )

针对服务器拒绝连接的问题,这个程序设置了 proxies (来源站大爷 - 免费代理 IP),如果还是遇到中断的问题,可以重新启动程序。

数据分析

数据分析必然还是不可缺少的,这里因为语言差异较大,我仅针对英国以及美国诗人做了词云的分析。

from wordcloud import WordCloud
import PIL.Image as image

def analyze(file):
    with open(file,encoding="iso-8859-15") as fp:
        text = fp.read()
        wordcloud = WordCloud(background_color=(255,255,255), width=1600, height=800).generate(text)
        image_produce = wordcloud.to_image()
        image_produce.save('cloud.png',quality=95,subsampling=0)
        image_produce.show()

analyze('cloud.txt')

相比中文来说,英文的词云分析就比较简单了,不需要分词,直接分析即可。结果如下: