I have been working to put together some resources which I will eventually clean up and share in conjunction with my Lua Tutorial Series in order to show what Lua can be used for, but also, to work with language analysis. Frequency analysis and similar techniques can be very useful for learning but obviously have limitations. This article will cover some of the results of an analysis on a specific editions of the Daodejing, as well as the methodology employed and the specific limitations with my approach.
Let’s start with the data first. The copy I used has 5,303 characters excluding punctuation and the chapter headings which are composed of 806 separate characters. Of these 806 characters, 332 are only seen once. This data on its own doesn’t really mean much, but we can delve even deeper with the data. Of the 806, the top 10 are as follows:
This totals up to 26.66% of the text, from 10 characters! Over 1 in 4 characters are composed of these 10. If we add 5 more characters, we hit 33.72% of the text.
If we extrapolate further, we can see that only 40 characters (I’m including the full list at the bottom of the article) compose just over 50% (50.16%) of the entire text. We start to get diminishing returns further on, but this shows the power of our analytics. We can use only 375 characters (or 46.5%) to get to where we can recognize 90.02% of the text. By using these frequency analysis methods, we can work with real materials to actually address what characters we need in order to understand the bulk of the text.
The methodology employed here is very simple which is why it works well but also has limitations. We are doing a simple count of each character and assigning it to a table to increment. Basically, the program works to just read the text, run a character a time, and add to the count of each character it sees. This approach works well for literary Chinese, but will obviously not work well with modern Mandarin unless we are just focusing on individual characters rather than words. We can account for this by using word lists to account for larger words and hope that the anomalies of extremely rare cases where two individual characters functioning as independent words which also exist as a word together are smoothed out by sample size.
Another thing that my method does not take into account is the different uses of individual characters. Certain characters have multiple readings and multiple meanings which this method makes no attempt to address. This is a much, much bigger can of worms which is extremely hard to address in a sane way at present. There are ways to address this, but these methods are way above and beyond this document, and in fairness, if I knew how to do it in an easy way, I would be rich.
Another thing to be aware of is how to shape your data. Most data will need to be standardized and cleaned in order for these types of processes to work, and certain data will need to be scrubbed in a controllable way which can take time and effort. For instance, I omitted all chapter headings as this would add a large block of characters which are not directly connected with the text. Knowing what you’re looking for affects how you tweak your data and tweak your process. What works for one purpose may taint the data for another purpose.
Data munging or data wrangling is an old Perl past time of mine. Working with data and knowing how to work with data is in itself an important skill in this process. Code pages and character encoding make a huge impact on your efforts. Character encoding can affect whether data is just gibberish or something you can work with depending on the library or system you employ for your analysis. I used Lua 5.1 for my project which required the use of an external UTF-8 (a Unicode encoding method) library. If I used standard Lua for my project, the data would have resulted in mainly gibberish.
When I get my code cleaned up, I will drop it here for further reference. The code I used is intentionally tailored to fit in with my plan for the Lua tutorial series so I have made some stylistic decisions to fit what I plan to cover. This project is helping me test my road map for the rest of the Lua course. Comment on what you think about this article and whether you want to see more, or anything you feel needs clarification.
Full Character Dump:
Char Count % Total
之 253 4.77%
不 244 4.60%
以 166 3.13%
其 141 2.66%
而 120 2.26%
為 117 2.21%
無 101 1.91%
者 92 1.74%
天 92 1.74%
人 88 1.66%
有 82 1.55%
下 81 1.53%
道 75 1.41%
是 70 1.32%
故 66 1.25%
大 58 1.09%
知 56 1.06%
善 52 0.98%
若 46 0.87%
德 46 0.87%
于 45 0.85%
生 37 0.70%
物 36 0.68%
可 35 0.66%
能 34 0.64%
自 33 0.62%
得 32 0.60%
民 32 0.60%
聖 32 0.60%
則 32 0.60%
夫 30 0.57%
謂 30 0.57%
常 30 0.57%
國 28 0.53%
兮 27 0.51%
欲 26 0.49%
所 26 0.49%
名 23 0.43%
身 23 0.43%
曰 23 0.43%
吾 22 0.42%
將 22 0.42%
強 22 0.42%
用 21 0.40%
事 21 0.40%
言 21 0.40%
貴 21 0.40%
莫 20 0.38%
萬 20 0.38%
足 20 0.38%
行 19 0.36%
地 19 0.36%
上 19 0.36%
我 19 0.36%
成 18 0.34%
或 18 0.34%
死 18 0.34%
也 17 0.32%
勝 16 0.30%
失 16 0.30%
長 16 0.30%
此 16 0.30%
復 15 0.28%
一 15 0.28%
唯 15 0.28%
信 15 0.28%
處 15 0.28%
多 15 0.28%
明 14 0.26%
必 14 0.26%
見 14 0.26%
何 14 0.26%
與 14 0.26%
相 14 0.26%
亦 13 0.25%
取 13 0.25%
治 13 0.25%
難 13 0.25%
乃 12 0.23%
器 12 0.23%
然 12 0.23%
玄 12 0.23%
兵 12 0.23%
居 11 0.21%
久 11 0.21%
使 11 0.21%
矣 11 0.21%
易 11 0.21%
孰 11 0.21%
王 11 0.21%
守 11 0.21%
靜 11 0.21%
三 11 0.21%
谷 11 0.21%
柔 11 0.21%
歸 11 0.21%
弱 10 0.19%
非 10 0.19%
乎 10 0.19%
敢 10 0.19%
去 10 0.19%
小 10 0.19%
爭 10 0.19%
同 10 0.19%
利 10 0.19%
焉 10 0.19%
終 10 0.19%
心 10 0.19%
觀 9 0.17%
殺 9 0.17%
如 9 0.17%
子 9 0.17%
美 9 0.17%
甚 9 0.17%
執 9 0.17%
餘 9 0.17%
眾 9 0.17%
樂 8 0.15%
正 8 0.15%
古 8 0.15%
和 8 0.15%
盈 8 0.15%
朴 8 0.15%
后 8 0.15%
皆 8 0.15%
神 8 0.15%
病 8 0.15%
令 8 0.15%
固 8 0.15%
仁 8 0.15%
智 8 0.15%
損 8 0.15%
畏 8 0.15%
至 8 0.15%
先 8 0.15%
輕 8 0.15%
惡 7 0.13%
中 7 0.13%
母 7 0.13%
出 7 0.13%
已 7 0.13%
功 7 0.13%
獨 7 0.13%
未 7 0.13%
始 7 0.13%
聞 7 0.13%
慈 7 0.13%
哉 7 0.13%
作 7 0.13%
百 7 0.13%
果 6 0.11%
雖 6 0.11%
辱 6 0.11%
益 6 0.11%
抱 6 0.11%
日 6 0.11%
極 6 0.11%
棄 6 0.11%
且 6 0.11%
君 6 0.11%
恐 6 0.11%
希 6 0.11%
根 6 0.11%
重 6 0.11%
猶 6 0.11%
主 6 0.11%
恃 5 0.09%
左 5 0.09%
牝 5 0.09%
遠 5 0.09%
寡 5 0.09%
兩 5 0.09%
致 5 0.09%
貨 5 0.09%
食 5 0.09%
侯 5 0.09%
象 5 0.09%
教 5 0.09%
修 5 0.09%
虛 5 0.09%
厭 5 0.09%
家 5 0.09%
義 5 0.09%
愛 5 0.09%
安 5 0.09%
親 5 0.09%
容 5 0.09%
禮 5 0.09%
法 5 0.09%
既 5 0.09%
士 5 0.09%
驚 5 0.09%
動 5 0.09%
害 5 0.09%
厚 5 0.09%
殆 5 0.09%
止 5 0.09%
式 5 0.09%
門 5 0.09%
傷 5 0.09%
敗 4 0.08%
禍 4 0.08%
服 4 0.08%
老 4 0.08%
軍 4 0.08%
盜 4 0.08%
學 4 0.08%
及 4 0.08%
少 4 0.08%
悶 4 0.08%
昏 4 0.08%
祥 4 0.08%
味 4 0.08%
微 4 0.08%
師 4 0.08%
救 4 0.08%
姓 4 0.08%
妙 4 0.08%
右 4 0.08%
入 4 0.08%
反 4 0.08%
怨 4 0.08%
徒 4 0.08%
似 4 0.08%
堅 4 0.08%
畜 4 0.08%
公 4 0.08%
十 4 0.08%
早 4 0.08%
司 4 0.08%
絕 4 0.08%
保 4 0.08%
恍 4 0.08%
繩 4 0.08%
敵 4 0.08%
察 4 0.08%
過 4 0.08%
亂 4 0.08%
清 4 0.08%
惚 4 0.08%
尚 4 0.08%
形 4 0.08%
奇 4 0.08%
存 4 0.08%
寵 4 0.08%
勇 4 0.08%
亡 4 0.08%
勿 4 0.08%
光 4 0.08%
患 4 0.08%
白 3 0.06%
全 3 0.06%
譽 3 0.06%
從 3 0.06%
建 3 0.06%
命 3 0.06%
鄉 3 0.06%
福 3 0.06%
深 3 0.06%
直 3 0.06%
次 3 0.06%
儉 3 0.06%
氣 3 0.06%
歙 3 0.06%
躁 3 0.06%
開 3 0.06%
退 3 0.06%
滋 3 0.06%
在 3 0.06%
玉 3 0.06%
四 3 0.06%
政 3 0.06%
尊 3 0.06%
兌 3 0.06%
太 3 0.06%
沖 3 0.06%
愈 3 0.06%
賤 3 0.06%
舍 3 0.06%
銳 3 0.06%
喪 3 0.06%
細 3 0.06%
徐 3 0.06%
賊 3 0.06%
音 3 0.06%
志 3 0.06%
合 3 0.06%
今 3 0.06%
肖 3 0.06%
俗 3 0.06%
解 3 0.06%
木 3 0.06%
凶 3 0.06%
積 3 0.06%
夷 3 0.06%
客 3 0.06%
混 3 0.06%
辯 3 0.06%
伐 3 0.06%
聲 3 0.06%
勤 3 0.06%
視 3 0.06%
前 3 0.06%
彌 3 0.06%
二 3 0.06%
目 3 0.06%
矜 3 0.06%
賢 3 0.06%
彼 3 0.06%
往 3 0.06%
精 3 0.06%
嬰 3 0.06%
廢 3 0.06%
巧 3 0.06%
馬 3 0.06%
後 3 0.06%
彰 3 0.06%
海 3 0.06%
化 3 0.06%
惟 3 0.06%
好 3 0.06%
閉 3 0.06%
遺 3 0.06%
富 3 0.06%
代 3 0.06%
戰 3 0.06%
真 3 0.06%
廣 3 0.06%
持 3 0.06%
淵 3 0.06%
五 3 0.06%
高 3 0.06%
狀 3 0.06%
愚 3 0.06%
私 3 0.06%
寶 3 0.06%
於 3 0.06%
缺 3 0.06%
當 3 0.06%
隨 3 0.06%
立 3 0.06%
水 3 0.06%
識 3 0.06%
兒 3 0.06%
進 3 0.06%
離 3 0.06%
割 2 0.04%
間 2 0.04%
云 2 0.04%
起 2 0.04%
孝 2 0.04%
笑 2 0.04%
甲 2 0.04%
邪 2 0.04%
宰 2 0.04%
嗇 2 0.04%
斯 2 0.04%
屈 2 0.04%
甘 2 0.04%
耳 2 0.04%
芻 2 0.04%
制 2 0.04%
戶 2 0.04%
稱 2 0.04%
辭 2 0.04%
孤 2 0.04%
朝 2 0.04%
克 2 0.04%
疏 2 0.04%
琭 2 0.04%
淡 2 0.04%
口 2 0.04%
被 2 0.04%
沌 2 0.04%
兕 2 0.04%
濁 2 0.04%
又 2 0.04%
馳 2 0.04%
牖 2 0.04%
異 2 0.04%
沒 2 0.04%
迷 2 0.04%
載 2 0.04%
來 2 0.04%
新 2 0.04%
張 2 0.04%
攘 2 0.04%
臂 2 0.04%
搏 2 0.04%
靈 2 0.04%
數 2 0.04%
寧 2 0.04%
薄 2 0.04%
宗 2 0.04%
脆 2 0.04%
滅 2 0.04%
臣 2 0.04%
遂 2 0.04%
方 2 0.04%
剛 2 0.04%
昭 2 0.04%
谿 2 0.04%
結 2 0.04%
鄰 2 0.04%
紛 2 0.04%
實 2 0.04%
芸 2 0.04%
塞 2 0.04%
狗 2 0.04%
珞 2 0.04%
江 2 0.04%
逝 2 0.04%
驕 2 0.04%
匠 2 0.04%
甫 2 0.04%
塵 2 0.04%
昧 2 0.04%
騁 2 0.04%
博 2 0.04%
恢 2 0.04%
散 2 0.04%
本 2 0.04%
各 2 0.04%
曲 2 0.04%
幾 2 0.04%
契 2 0.04%
腹 2 0.04%
哀 2 0.04%
几 2 0.04%
資 2 0.04%
交 2 0.04%
室 2 0.04%
淳 2 0.04%
鎮 2 0.04%
飢 2 0.04%
謀 2 0.04%
聽 2 0.04%
脫 2 0.04%
威 2 0.04%
窮 2 0.04%
挫 2 0.04%
求 2 0.04%
補 2 0.04%
綿 2 0.04%
己 2 0.04%
牡 2 0.04%
虎 2 0.04%
骨 2 0.04%
弗 2 0.04%
養 2 0.04%
受 2 0.04%
除 2 0.04%
忠 2 0.04%
奉 2 0.04%
應 2 0.04%
首 2 0.04%
扔 2 0.04%
文 2 0.04%
斲 2 0.04%
力 2 0.04%
熙 2 0.04%
川 2 0.04%
稽 2 0.04%
奈 2 0.04%
乘 2 0.04%
鬼 2 0.04%
壯 2 0.04%
華 2 0.04%
報 2 0.04%
孩 2 0.04%
雌 2 0.04%
榮 2 0.04%
咎 2 0.04%
兆 2 0.04%
六 1 0.02%
迎 1 0.02%
佐 1 0.02%
孫 1 0.02%
官 1 0.02%
郊 1 0.02%
田 1 0.02%
丈 1 0.02%
素 1 0.02%
夸 1 0.02%
覽 1 0.02%
金 1 0.02%
短 1 0.02%
噓 1 0.02%
徙 1 0.02%
涉 1 0.02%
跡 1 0.02%
襲 1 0.02%
佳 1 0.02%
冬 1 0.02%
雞 1 0.02%
悲 1 0.02%
泣 1 0.02%
羸 1 0.02%
晚 1 0.02%
儼 1 0.02%
偏 1 0.02%
詰 1 0.02%
狎 1 0.02%
輟 1 0.02%
享 1 0.02%
投 1 0.02%
曠 1 0.02%
台 1 0.02%
慧 1 0.02%
聾 1 0.02%
伯 1 0.02%
降 1 0.02%
湛 1 0.02%
跨 1 0.02%
奢 1 0.02%
奧 1 0.02%
勢 1 0.02%
興 1 0.02%
營 1 0.02%
拙 1 0.02%
祭 1 0.02%
穀 1 0.02%
號 1 0.02%
埴 1 0.02%
寥 1 0.02%
市 1 0.02%
頑 1 0.02%
孔 1 0.02%
恬 1 0.02%
土 1 0.02%
妨 1 0.02%
蔽 1 0.02%
輜 1 0.02%
窪 1 0.02%
貧 1 0.02%
駟 1 0.02%
黑 1 0.02%
毒 1 0.02%
理 1 0.02%
御 1 0.02%
阿 1 0.02%
托 1 0.02%
發 1 0.02%
什 1 0.02%
畋 1 0.02%
卻 1 0.02%
牢 1 0.02%
策 1 0.02%
悠 1 0.02%
約 1 0.02%
衣 1 0.02%
刃 1 0.02%
飉 1 0.02%
窈 1 0.02%
隳 1 0.02%
徼 1 0.02%
盲 1 0.02%
閱 1 0.02%
荒 1 0.02%
免 1 0.02%
枯 1 0.02%
帶 1 0.02%
枉 1 0.02%
滿 1 0.02%
惑 1 0.02%
攫 1 0.02%
泊 1 0.02%
劌 1 0.02%
外 1 0.02%
時 1 0.02%
超 1 0.02%
狂 1 0.02%
配 1 0.02%
餌 1 0.02%
路 1 0.02%
楗 1 0.02%
魄 1 0.02%
蕪 1 0.02%
並 1 0.02%
棘 1 0.02%
平 1 0.02%
燕 1 0.02%
謫 1 0.02%
籌 1 0.02%
贅 1 0.02%
風 1 0.02%
舟 1 0.02%
登 1 0.02%
轍 1 0.02%
瑕 1 0.02%
皦 1 0.02%
壽 1 0.02%
紀 1 0.02%
侮 1 0.02%
魚 1 0.02%
荊 1 0.02%
角 1 0.02%
隱 1 0.02%
鄙 1 0.02%
歟 1 0.02%
柢 1 0.02%
蘥 1 0.02%
況 1 0.02%
慎 1 0.02%
年 1 0.02%
拔 1 0.02%
獵 1 0.02%
央 1 0.02%
豫 1 0.02%
輻 1 0.02%
釋 1 0.02%
春 1 0.02%
基 1 0.02%
普 1 0.02%
晦 1 0.02%
澹 1 0.02%
昔 1 0.02%
泰 1 0.02%
抑 1 0.02%
徹 1 0.02%
偽 1 0.02%
字 1 0.02%
寂 1 0.02%
舉 1 0.02%
活 1 0.02%
露 1 0.02%
赤 1 0.02%
武 1 0.02%
通 1 0.02%
還 1 0.02%
怒 1 0.02%
纇 1 0.02%
裂 1 0.02%
寸 1 0.02%
含 1 0.02%
尺 1 0.02%
衛 1 0.02%
獸 1 0.02%
冰 1 0.02%
關 1 0.02%
嗄 1 0.02%
螫 1 0.02%
渝 1 0.02%
順 1 0.02%
糞 1 0.02%
鳥 1 0.02%
誠 1 0.02%
抗 1 0.02%
周 1 0.02%
劍 1 0.02%
臺 1 0.02%
共 1 0.02%
企 1 0.02%
推 1 0.02%
繕 1 0.02%
采 1 0.02%
石 1 0.02%
敝 1 0.02%
褐 1 0.02%
飲 1 0.02%
攝 1 0.02%
祀 1 0.02%
飄 1 0.02%
橐 1 0.02%
財 1 0.02%
憂 1 0.02%
召 1 0.02%
質 1 0.02%
豈 1 0.02%
網 1 0.02%
忌 1 0.02%
吹 1 0.02%
鑿 1 0.02%
烹 1 0.02%
鮮 1 0.02%
握 1 0.02%
圖 1 0.02%
達 1 0.02%
諾 1 0.02%
泮 1 0.02%
措 1 0.02%
毫 1 0.02%
罪 1 0.02%
兼 1 0.02%
熱 1 0.02%
弊 1 0.02%
宜 1 0.02%
流 1 0.02%
吉 1 0.02%
竭 1 0.02%
璧 1 0.02%
坐 1 0.02%
輔 1 0.02%
妄 1 0.02%
費 1 0.02%
里 1 0.02%
千 1 0.02%
梁 1 0.02%
色 1 0.02%
拜 1 0.02%
渙 1 0.02%
據 1 0.02%
廉 1 0.02%
耀 1 0.02%
闔 1 0.02%
訥 1 0.02%
帝 1 0.02%
譬 1 0.02%
層 1 0.02%
累 1 0.02%
蹶 1 0.02%
伏 1 0.02%
妖 1 0.02%
倚 1 0.02%
示 1 0.02%
懼 1 0.02%
稷 1 0.02%
犬 1 0.02%
偷 1 0.02%
望 1 0.02%
隅 1 0.02%
均 1 0.02%
改 1 0.02%
施 1 0.02%
豐 1 0.02%
埏 1 0.02%
諱 1 0.02%
貸 1 0.02%
陰 1 0.02%
輿 1 0.02%
賓 1 0.02%
九 1 0.02%
負 1 0.02%
陽 1 0.02%
驟 1 0.02%
域 1 0.02%
陳 1 0.02%
蓋 1 0.02%
篤 1 0.02%
比 1 0.02%
誰 1 0.02%
雨 1 0.02%
寄 1 0.02%
伎 1 0.02%
傾 1 0.02%
肆 1 0.02%
滌 1 0.02%
冥 1 0.02%
疵 1 0.02%
歇 1 0.02%
雄 1 0.02%
倍 1 0.02%
濟 1 0.02%
育 1 0.02%
習 1 0.02%
置 1 0.02%
猛 1 0.02%
倉 1 0.02%
經 1 0.02%
蟲 1 0.02%
車 1 0.02%
父 1 0.02%
敦 1 0.02%
爪 1 0.02%
草 1 0.02%
稅 1 0.02%
轂 1 0.02%
折 1 0.02%
覆 1 0.02%
貞 1 0.02%
專 1 0.02%
弓 1 0.02%
攻 1 0.02%
藏 1 0.02%
手 1 0.02%
屬 1 0.02%
懷 1 0.02%
加 1 0.02%
要 1 0.02%
殃 1 0.02%
徑 1 0.02%
注 1 0.02%
熟 1 0.02%
堂 1 0.02%
揣 1 0.02%
蒞 1 0.02%
尤 1 0.02%
槁 1 0.02%
爽 1 0.02%
寒 1 0.02%
末 1 0.02%
忒 1 0.02%
朘 1 0.02%
拱 1 0.02%
垢 1 0.02%
責 1 0.02%
筋 1 0.02%
介 1 0.02%
氾 1 0.02%
社 1 0.02%
走 1 0.02%
戎 1 0.02%
遇 1 0.02%
窺 1 0.02%
渾 1 0.02%
耶 1 0.02%