4.1 Article

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

Journal

LANGUAGE RESOURCES AND EVALUATION
Volume 52, Issue 1, Pages 1-28

Publisher

SPRINGER
DOI: 10.1007/s10579-017-9393-8

Keywords

Historical corpus; Corpus annotation; Morphological analysis; PoS tagging; Middle Hungarian; Old Hungarian; Corpus query tool

Funding

  1. Hungarian Scientific Research Fund (OTKA) [OTKA 81189]
  2. OTKA
  3. Hungarian Scientific Research Fund [OTKA K 116217]

Ask authors/readers for more resources

The paper introduces a novel annotated corpus of Old and Middle Hungarian (16-18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.1
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available