A simple lemmatizer. It can be used with predefined lemmatizers in binary format or with text files that contain a full form lexicon with optional grammatical information. It is also possible to create binary files from the specified text files for more efficienta and repeated use.

By default, the dictionary text file format consists of 2 to 4 tab-separated columns with the following meaning:

  1. word form - (required) the inflected word form, may be repeated,
  2. lemma - (required) the base form, may be repeated,
  3. part-of-speech tag - (optional) a single part of speech tag and several, optional tag related features,
  4. morfological features - (optional) zero or more morphological features.

The most simple format consists of two columns contain only word forms and lemmas (tab as primary separator):

Ala\tAl Ala\tAla Alego\tAl Alę\tAla Aly\tAla ma\tmieć ma\tmój

The same with part-of-speech tags and some morphological features (single space as secondary separator):

Ala\tAl\tsubst\tcase=acc gender=m1 number=sg Ala\tAla\tsubst\tcase=nom gender=f number=sg ma\tmieć\tverb\tnumber=sg aspect=imperf person=ter tense=fin ma\tmój\tadj\tcase=nom gender=f number=sg degree=pos

If the text file contains part-of-speech and/or morphological information, this has to be stated explicitly with --pos and --morpho respectively to include this data in the analysis or the construction of a binary version. This information will be saved in the binary version. The --morpho option implies --pos. The default separators (tab for columns, space for inner-column features) can be changed with --primary-separator and --secondary-separator respectivly.

The default morphological dictionary of Polish for Lammerlemma lemmatizer was created using linguistic data from SGJP Grammatical Dictionary of Polish.


  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --binary-lexicon arg (=%ITSDATA%/%LANG%.bin)
                                        path to the lexicon in the binary 
  --level arg (=3)                      set word processing level 0-3 (0 - do 
                                        nothing, 1 - return only base forms, 2 
                                        - add grammatical class and main 
                                        attributes, 3 - add detailed 
  --plain-text-lexicon arg              path to the lexicon in the plain text 
  --save-binary-lexicon arg             as a side effect the lexicon in the 
                                        binary format is generated

