Specification for lemmatizer
lamerlemma
A simple lemmatizer. It can be used with predefined lemmatizers in binary format or with text files that contain a full form lexicon with optional grammatical information. It is also possible to create binary files from the specified text files for more efficienta and repeated use.
By default, the dictionary text file format consists of 2 to 4 tab-separated columns with the following meaning:
- word form - (required) the inflected word form, may be repeated,
- lemma - (required) the base form, may be repeated,
- part-of-speech tag - (optional) a single part of speech tag and several, optional tag related features,
- morfological features - (optional) zero or more morphological features.
The most simple format consists of two columns contain only word forms and lemmas (tab as primary separator):
Ala\tAl Ala\tAla Alego\tAl Alę\tAla Aly\tAla ma\tmieć ma\tmój
The same with part-of-speech tags and some morphological features (single space as secondary separator):
Ala\tAl\tsubst\tcase=acc gender=m1 number=sg Ala\tAla\tsubst\tcase=nom gender=f number=sg ma\tmieć\tverb\tnumber=sg aspect=imperf person=ter tense=fin ma\tmój\tadj\tcase=nom gender=f number=sg degree=pos
If the text file contains part-of-speech and/or morphological information, this
has to be stated explicitly with --pos
and --morpho
respectively to include
this data in the analysis or the construction of a binary version. This
information will be saved in the binary version. The --morpho
option implies
--pos
. The default separators (tab for columns, space for inner-column
features) can be changed with --primary-separator
and --secondary-separator
respectivly.
The default morphological dictionary of Polish for Lammerlemma lemmatizer was created using linguistic data from SGJP Grammatical Dictionary of Polish.
Aliases
lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizerLanguages
de, en, es, fr, it, plOptions
--lang arg (=guess) language --force-language force using specified language even if a text was resognised otherwise --binary-lexicon arg (=%ITSDATA%/%LANG%.bin) path to the lexicon in the binary format --level arg (=3) set word processing level 0-3 (0 - do nothing, 1 - return only base forms, 2 - add grammatical class and main attributes, 3 - add detailed attributes) --plain-text-lexicon arg path to the lexicon in the plain text format --save-binary-lexicon arg as a side effect the lexicon in the binary format is generated --lexeme-ids Disambiguate homonymic lexemes with lexeme ids
morfologik
Morfologik is a Polish morphological analyzer and lemmatizer. It returns morphosyntactic information for each token: base forms, grammatical class and attributes.
Values returned by Morfologik are described on page Znaczniki Morfologika (in Polish). In general, Morfologik's tagset is similar to the tagset of National Corpus of Polish, so you can also see http://nkjp.pl/poliqarp/help/ense2.html for more details.
Aliases
lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizerLanguages
plExamples
morfologik ! simple-writer --tags lexeme
Returns all base forms and grammatical classes for each word.
Wszędzie dobrze, ale w domu najlepiej.
wszędzie+adv
dobro+subst|dobry+adv|dobrze+adv
ala+qub|ale+conj
w+prep|wiek+brev
dom+subst
dobrze+adv
morfologik ! simple-writer --tags lemma
Returns all base forms for each word.
Ala ma kota i psa.
Al|Ala
mieć|mój
kot|kota
i
pies
Options
Allowed options: --level arg (=3) set word processing level 0-3 (0 - do nothing, 1 - return only base forms, 2 - add grammatical class and main attributes, 3 - add detailed attributes) --dict arg (=morfologik) set dictionary, one of morfologik, morfeusz, combined --keep-original keep original Morfologik's settings i.e. do not break brief forms --token-tag arg (=token) tag to operate on instead of token tag