Language, Script, and Encoding Identification with String Kernel Classifiers

Canasai Kruengkrai, Virach Sornlertlamvanich, and Hitoshi Isahara

Abstract

This paper discusses the problem of language, script, and encoding (LSE) identification for written text based on string kernel classifiers. We describe three increasingly efficient methods of string kernel computation, including explicit mapping, brute-force matching, and suffix tree matching. To perform LSE identification, the string kernel is incorporated with two different kernel classifiers: the kernelized centroid-based method and the support vector machine classifier. We present experimental results based on subsets of UDHR collection, consisting of 10 LSE schemes used in India and 24 LSE schemes used in Africa.

Download: pdf, ps demo and source code


Canasai Kruengkrai