Friedrich Kemler

Hierarchical Clustering of Digital Texts

Datum: Mo., 12. Januar 2015, 11:00
Ort: Institut für Kultur- und Geistesgeschichte Asiens, Seminarraum 1
Apostelgasse 23, 1030 Wien
Organisation: Bernhard Scheid


In this presentation, I introduce NLP (Natural Language Processing) and demonstrate its usefulness in a similarity analysis of texts, in this case the Taishō and Daozang canon. I discuss the conceptual and technical base of NLP as well as some applications already well established in mainstream IT. In order to give an idea what can be done with such techniques in the field of Asian languages, I present two simple ad hoc examples. One consists of a trivial form of relation extraction from the Taishō canon using regular expressions. The second shows an application of automated syntax analysis for providing convenience punctuation to a Japanese text. In the main part, I present a method of TF/IDF (Term Frequency / Inverse Document Frequency) vectorization of the Taishō and Daozang followed by the usage of distance metrics and hierarchical clustering for similarity analysis. The mathematical concepts behind these methods, which were first applied to the analysis of genetic sequences, are also discussed briefly.

Works of the early Chan tradition (taken from R. Sharf, Coming to terms with Chinese Buddhism, 2002), combined with random texts (R) from the Buddhist and Daoist canons, in hierarchical clusters achieved by TF/IDF vectorization.

The above illustration demonstrates how this method groups early Chan-related texts from the Taishō canon. In order to complicate the task for the program, they are thrown into a pond of random samples, 10 from the Taishō and 10 from the Daozang, marked with an “R”. The result shows a tight grouping of Seng Chao's comment on the Vimalakīrti Sūtra, the Treasure Store Treatise, and the Chaolun together, and the Mushinron and the Shinjinmei still in the center group. The random samples are divided cleanly into a Buddhist and a Daoist group with one interesting exception: The one random sample sneaking into the center group happens to be the “Record of the Transmission of the Lamp”, which seems to make perfect sense.


Dr. Friedrich Kemler studied Biology and Biochemistry at Vienna University. After a PhD thesis in Biomathematics and some post doc work, he joined Siemens for about 23 years, mostly working at modeling and simulating complex systems. During a one year sabbatical and at later occasions, he took courses in Japanese Studies at Vienna University, including bungo and kanbun.

