A Resource-Light Approach to Morpho-Syntactic Tagging by Anna Feldman

By Anna Feldman

Whereas supervised corpus-based tools are hugely actual for various NLP tasks, together with morphological tagging, they're tricky to port to different languages simply because they require assets which are dear to create. hence, many languages don't have any reasonable prospect for morpho-syntactic annotation within the foreseeable destiny. the strategy offered during this booklet goals to beat this challenge via considerably proscribing the mandatory info and in its place extrapolating the appropriate info from one other, comparable language. The process has been proven on Catalan, Portuguese, and Russian. even though those languages are just really resource-poor, an identical strategy may be in precept utilized to any inflected language, so long as there's an annotated corpus of a similar language to be had. Time wanted for adjusting the method to a brand new language constitutes a fragment of the time wanted for platforms with vast, manually created assets: days rather than years. This publication touches upon a few themes: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and average Language Processing (NLP). Researchers and scholars who're attracted to those medical parts in addition to in cross-lingual reviews and functions will drastically take advantage of this paintings. students and practitioners in computing device technological know-how and linguistics are the potential readers of this ebook.

Show description

Read or Download A Resource-Light Approach to Morpho-Syntactic Tagging PDF

Best study & teaching books

Lecture Notes on Complex Analysis

This booklet relies on lectures provided over a long time to moment and 3rd 12 months arithmetic scholars within the arithmetic Departments at Bedford collage, London, and King's collage, London, as a part of the BSc. and MSci. application. Its goal is to supply a steady but rigorous first direction on advanced research.

Intensive exposure experiences in second language learning

This quantity brings jointly stories facing moment language studying in contexts that supply in depth publicity to the objective language. In doing so, it highlights the function of in depth publicity as a serious unique attribute within the comparability of studying procedures and results from diversified studying contexts: naturalistic and overseas language guideline, remain in another country and at domestic, and vast and in depth guideline programmes.

Additional resources for A Resource-Light Approach to Morpho-Syntactic Tagging

Sample text

Unsupervised methods 17 The lexicon has three parts — a full-form lexicon, a suffix lexicon, and a default entry; each of the three parts covers a priori tag probabilities for each lexical entry. 22% for English (trained on 2M words). Tagging inflected languages with neural networks Nˇemec (2004) uses a neural networks approach to Czech. 1). Various contexts lengths were evaluated, but the best results were obtained using the left context of length 2 and suffix of length 4. 71% accuracy. 2 Unsupervised methods As mentioned above, the problem with using supervised models for tagging resource-poor languages is that supervised models assume the existence of a labeled training corpus.

The MaxEnt tagger has been tried on Slovene as well. The performance was around 86%, as with the supervised TBL model. The MB tagger performs similarly to the MaxEnt for Slovene. 71%. But Džeroski et al. (2000) reports that training times for the MaxEnt and the RB tagger are unacceptably long (over a day for training), while the MB taggers and the TnT tagger are much more efficient. The next section describes a tagger which was designed for morphologically rich languages in general, and for Czech, in particular.

The following sections describe in more detail several approaches to morphology and part-of-speech tagging that use minimal supervision. Yarowsky and Wicentowski (2000) Yarowsky and Wicentowski (2000) present an original algorithm for the nearly unsupervised induction of inflectional morphological analysis. They treat morphological analysis as an alignment task in a large corpus, combining four similarity measures based on expected frequency distributions, context, morphologicallyweighted Levenshtein distance, and an iteratively bootstrapped model of affixation and stem-change probabilities.

Download PDF sample

Rated 4.67 of 5 – based on 40 votes