總字數: 486739
篇數: 1099
平均每篇的字數: 442
呀, 想不到六年下來寫了四十八萬字!當然這個程式有個問題,就是會把英文字母當成一個字計算。但在這裏寫的英文不多,相信影響也不會太大吧。下一步是想辦法把它打印出來,畢竟放在網上太沒保障了。
下面是我用的Python程式,其實很簡單。懂的就會懂,不懂的直接無視就好,哈。
#coding: utf-8
from xml.etree import ElementTree
import re
#load and parse the Blogger export file
tree=ElementTree.parse(open('blog-02-28-2011.xml','r'))
root=tree.getroot()
entries=root.findall('{http://www.w3.org/2005/Atom}entry') #have to use the fully qualified name
wordCount=0;
contentCount=0;
for entry in entries:
categoryTerm=entry.find('{http://www.w3.org/2005/Atom}category').attrib['term'];
if(categoryTerm.find('#post') != -1): #the entry is a post entry
#retrieve the content
content=entry.find('{http://www.w3.org/2005/Atom}content').text
#remove the html tag and spaces
content=re.sub('(<.*?>)','',content).replace(' ','')
print content
print len(content)
wordCount=wordCount+len(content)
if (len(content)!=0):
contentCount=contentCount+1;
print '-----------------'
print 'Total word count is: {0}'.format(wordCount)
print 'Total post count: {0}'.format(contentCount)
print 'Words per post: {0}'.format(wordCount/contentCount)
沒有留言:
發佈留言