☆ 4.1 Article

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

LANGUAGE RESOURCES AND EVALUATION (2018)

Journal

LANGUAGE RESOURCES AND EVALUATION

Volume 52, Issue 1, Pages 1-28

Publisher

SPRINGER

DOI: 10.1007/s10579-017-9393-8

Keywords

Historical corpus; Corpus annotation; Morphological analysis; PoS tagging; Middle Hungarian; Old Hungarian; Corpus query tool

Funding

Hungarian Scientific Research Fund (OTKA) [OTKA 81189]
OTKA
Hungarian Scientific Research Fund [OTKA K 116217]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The paper introduces a novel annotated corpus of Old and Middle Hungarian (16-18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/.

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

Journal

LANGUAGE RESOURCES AND EVALUATION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

Journal

LANGUAGE RESOURCES AND EVALUATION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper