For speech recognition, OCR, etc. determination of the structural properties of a natural language is essential. These properties can be analyzed under two different categories; morphological and statistical analysis. For statistical analysis, a corpus which is a representative sample of the natural language is needed. Word n-gram frequencies of that corpus can be determined by using suitable algorithms and missing n-grams can be estimated by using smoothing techniques. In this study, in order to compare and apply smoothing techniques to Turkish, a corpus named TurCo was created....
For speech recognition, OCR, etc. determination of the structural properties of a natural language is essential. These properties can be analyzed u...
Models of natural languages and language characteristics are widely used in many computer science applications such as data security, language identification, spell checking, data compression, authorship attribution and speech recognition. In the scope of this study, a large scale corpus is created and used to discover language characteristics of Turkish. Word and letter based analyses are made on this corpus to build a base for several NLP studies. In the author identification part, we used two different methods based on word n- grams to identify author of an anonymous text. For 16 authors,...
Models of natural languages and language characteristics are widely used in many computer science applications such as data security, language identif...