Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

PSI-Pipe specification [hide all] [show all]

Click on processor name to show its description.

readers

apertium-reader

Apertium-reader allows you to read text in various markup formats, such as: HTML documents, RTF files, Open-Office Writer odt or Microsoft Office 2007 formats: docx, xlsx and pptx. The default format for apertium-reader is html. To read text from doc files use doc-reader.

[show more] [open in single page]

Examples

apertium-reader --format docx ! simple-writer --tags frag

Reads DOCX file with --format docx option and writes only text fragments.

in:
/storage/18cb02b23c80441c21e9163057d78a80.UNKNOWN
out:
Przykładowy nagłówek.
Przykładowy tekst pierwszego akapitu.
Tekst w drugim akapicie.
apertium-reader --format rtf ! simple-writer --tags frag

Reads RTF file with --format rft option and writes only text fragments using simple-writer.

in:
/storage/ed33e779887ea8587d9de91e44a34e9e.rtf
out:
Title
Text in first paragraph.
Second paragraph.
apertium-reader ! simple-writer --tags frag

Reads HTML file and outputs only text content.

in:
/storage/778ca8d7d38111cb4efe0cbc2268f3ff.html
out:
Header
Text in first paragraph.
Second paragraph.

Options

Allowed options:
  --format arg (=html)     type of file for deformatting
  --specification-file arg specification file path
  --unzip-data arg (=1)    unzip compressed file formats like .pptx or .xlsx
  --keep-tags              keep formatting tags

guessing-reader

Guessing-reader tries to guess input format so that manual specifying of reader could be omitted.

[open in single page]

Aliases

guess-format, guess-input

Examples

guess-input ! simple-writer --tags frag

Reads HTML file content without using dedicated reader.

Options

Allowed options:
  --block-size arg (=32) the size of the input data used to determine the 
                         format

nkjp-reader

NKJP Reader reads texts from Polish National Corpus's XML files into the PSI-lattice. You can find more information about NKJP on the project's website: http://nkjp.pl

Papers describing NKJP format and sample files: http://nlp.ipipan.waw.pl/TEI4NKJP/

[open in single page]

Aliases

read-nkjp

Examples

nkjp-reader ! simple-writer --tags token --sep / --spec sentence \\n

Read from NKJP (Polish National Corpus) and print segmentation information.

in:
/storage/9fcec560a4db8608f0b0e88f697090d4.xml
out:
Za/swe/publikacje/naukowe/został/7-krotnie/wyróżniony/nagrodą/Rektora/UŁ/.
Był/doradcą/ds/./strategii/marketingowych/.

pdf-reader

PDF Reader reads text from PDF file into the lattice. To use it, need to have poppler-glib library installed on your system.

[open in single page]

Aliases

read-pdf

Examples

read-pdf ! write --tags frag

Read text from PDF file.

psi-reader

PSI Reader reads a file in PSI format into the lattice.

[open in single page]

Aliases

lattice-reader, read-lattice, read-psi

Examples

psi-reader ! write --tags frag

Read PSI format, write simple text.

in:
/storage/d6f880133a719061199f7417a3a1a0d9.psi
out:
Lorem ipsum dolor sit amet
consectetur
adipiscing elit.

txt-reader

Text Reader reads a file of plain text into the lattice. Each line of the source text (without trailing newline character) becomes a single edge in the lattice (of category FRAG).

Currently options are not implemented yet.

[open in single page]

Aliases

read, read-text, read-txt, text-reader

Examples

txt-reader ! psi-writer

Read text into lattice and write its contents in PSI format.

in:
Lorem ipsum dolor sit amet
consectetur
adipiscing elit.
out:
## beg. len.  text         tags                  annot.text  annotations
01 0000 26    Lorem_ipsum_dolor_sit_amet frag,txt-reader Lorem_ipsum_dolor_sit_amet FRAG[]
02 0026 01    \n           ∅                     ∅           ∅
03 0027 11    consectetur  frag,txt-reader       consectetur FRAG[]
04 0038 01    \n           ∅                     ∅           ∅
05 0039 16    adipiscing_elit. frag,txt-reader   adipiscing_elit. FRAG[]

Options

Allowed options:
  --line-by-line           processes line by line
  --whole-text             read the whole text
  --paragraphs             paragraphs are delimited with double newlines
  --discard-comments       discards comments
  --pass-through-comments  marks comments as single markup

utt-reader

UTT Reader reads a file in UTT (UAM Text Tools) format into the lattice.

UTT is a package of language processing tools developed at Adam Mickiewicz University. You can find more information about the project at http://utt.amu.edu.pl.

The detailed description of the UTT file format can be found at http://utt.amu.edu.pl/files/utt.html#UTT-file-format.

[open in single page]

Aliases

read-utt

Examples

utt-reader ! psi-writer

Convert UTT format to PSI format.

in:
     0000 00 BOS *
     0000 07 W Piszemy lem:pisać,V
     0007 01 S _
     0008 05 W dobre lem:dobry,ADJ
     0013 01 S _
     0014 08 W progrumy cor:programy lem:program,N
     0022 01 P .
     0023 00 EOS *
     0023 01 S _
     0024 00 BOS *
     0024 11 W Warszawiacy lem:Warszawiak,N
     0035 01 S _
     0036 03 W też
     0039 01 P .
     0040 00 EOS *
out:
## beg. len.  text         tags                  annot.text  annotations
01 0000 07    Piszemy      token                 Piszemy     'Piszemy',lem=pisać,V
02 0007 01    _            token                 _           '_'
03 0008 05    dobre        token                 dobre       'dobre',lem=dobry,ADJ
04 0013 01    _            token                 _           '_'
05 0014 08    progrumy     token                 progrumy    'progrumy',lem=program,N,cor=programy
06 0022 01    .            token                 .           '.'
07 0000 23    Piszem...my. sentence              Piszemy_dobre_progrumy. sen[1-2-3-4-5-6]
08 0023 01    _            token                 _           '_'
09 0024 11    Warszawiacy  token                 Warszawiacy 'Warszawiacy',lem=Warszawiak,N
10 0035 01    _            token                 _           '_'
11 0036 04    też          token                 też         'też'
12 0040 01    .            token                 .           '.'
13 0024 17    Warsza...eż. sentence              Warszawiacy_też. sen[9-10-11-12]

annotators

converters

joiner

Joiner combines all the edges matching one tag mask ("left" mask given with --left-mask option) with all the edges spanning the same pair of vertices, having the same parent edge and matching another tag mask (--right-mask option). In other words, joiner generates a Cartesian product of two sets of edges for the same pair of vertices and the same parent edge.

[show more] [open in single page]

Options

Allowed options:
  --lang arg (=guess)      language
  --force-language         force using specified language even if a text was 
                           resognised otherwise
  --left-mask arg          tag mask specification for the "left" mask
  --right-mask arg         tag mask specification for the "right" mask
  --out-tags arg           tags for generated edges
  --take-right-text        take the text field from the "right" edge (by 
                           default, the left one is used)
  --take-right-category    take the text field from the "right" edge (by 
                           default, the right one is used
  --take-left-attributes   take only attributes from the "left" edge (by 
                           default the attributes from the "left" one and the 
                           "right" one are merged
  --take-right-attributes  take only attributes from the "right" edge
  --no-outer-join          switch off "outer" join
  --extended-outer-join    switch on "extended" outer join

morfologik-to-gobio-converter

This processor is an alias for tagset-converter with options --lang pl.

A tool for converting Morfologik tags to Gobio tags.

[open in single page]

Languages

pl

pl-join-forms-and-valency

This processor is an alias for joiner with options --lang pl --left-mask form,gobio-tagset --right-mask gobio-tagset,valency --out-tags parse-terminal,!parse,!pl --take-right-text --extended-outer-join.

A tool that combines form edges and valency edges. It creates a kind of Cartesian product of edges in question.

[show more] [open in single page]

Languages

pl

selector

Selects edges tagged with specified tags and creates new edges based on the selected edges.

[open in single page]

Examples

iayko ! niema ! morfologik ! selector --fallback-tag iayko ! psi-writer

Basic usage of selector. From edges tagged as "conditional", selector selects the ones whose corresponding "form" edges satisfies the given condition, and tags them as "selected".

in:
niemódz
out:
## beg. len.  text         tags                  annot.text  annotations
01 0000 *@0   nie          !pl,conditional,niema,normalization,token nie T,condition=\*[]
02 0000 *@0   nie          !pl,morfologik,morfologik-tagset,normalization nie term[1]
03 0000 *@0   nie          !pl,lemma,morfologik,morfologik-tagset nie word[2]
04 0000 *@0   nie+qub      !pl,lexeme,morfologik,morfologik-tagset nie+qub qub[3]
05 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie qub[4]
06 0000 *@0   on           !pl,lemma,morfologik,morfologik-tagset on word[2]
07 0000 *@0   on+ppron3    !pl,lexeme,morfologik,morfologik-tagset on+ppron3 ppron3[6]
08 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=f,number=pl,person=ter,post-prepositionality=praep[7]
09 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=m2,number=pl,person=ter,post-prepositionality=praep[7]
10 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=m3,number=pl,person=ter,post-prepositionality=praep[7]
11 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=n,number=pl,person=ter,post-prepositionality=praep[7]
12 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=p2,number=pl,person=ter,post-prepositionality=praep[7]
13 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=p3,number=pl,person=ter,post-prepositionality=praep[7]
14 0000 *@0   nie          !pl,form,morfologik,morfologik-tagset nie ppron3,case=acc,gender=n,number=sg,person=ter,post-prepositionality=praep[7]
15 0000 *@0   nie          selected,token        nie         T,condition=\*[]
16 @0   *@1   _            !pl,conditional,niema,normalization,token _ B[]
17 0000 08    niemódz      frag,txt-reader       niemódz     FRAG[]
18 0000 08    niemódz      !pl,lang-guesser,text niemódz     TEXT[17]
19 0000 08    niemódz      !pl,token             niemódz     T
20 0000 08    niemódz      !pl,iayko,normalization,token niemóc T[]
21 @1   *0008 móc          !pl,conditional,niema,normalization,token móc T,condition=verb[]
22 0000 08    niemódz      !pl,conditional,niema,normalization,token niemócy T,condition=verb&aspect=imperf&number=pl&person=pri[]
23 @1   *0008 móc          !pl,morfologik,morfologik-tagset,normalization móc term[21]
24 @1   *0008 móc          !pl,lemma,morfologik,morfologik-tagset móc word[23]
25 @1   *0008 móc+verb     !pl,lexeme,morfologik,morfologik-tagset móc+verb verb,aspect=imperf[24]
26 @1   *0008 móc          !pl,form,morfologik,morfologik-tagset móc verb,aspect=imperf,tense=inf[25]
27 @1   *0008 móc          selected,token        móc         T,condition=verb[]
diachronize --line-by-line

Diachronic normalization, line by line

in:
będziem
chciał
niechciał
niemódz
out:
będziemy
chciał
nie chciał
nie móc
diachronize

Diachronic normalization

in:
będziem chciał niechciał niemódz
out:
będziemy chciał nie chciał nie móc

Options

Allowed options:
  --lang arg (=guess)              language
  --force-language                 force using specified language even if a 
                                   text was resognised otherwise
  --in-tag arg (=conditional)      tag to select when condition succeeds
  --fallback-tag arg (=token)      tag to select when condition fails
  --test-tag arg                   tag to test condition
  --out-tags arg (=selected,token) tags to mark selected edges
  --with-blank                     do not skip edges with whitespace text

tagset-converter

A tag converter specifically geared toward, but not limited to, conversion of morphological tags.

Tagset-converter has some predefined sets of rules for several configurations, but if you want, you can provide your own rule file with --rules option. In general, there are two kinds of rules: simple substitutions and more complex if-then rules. The detailed tutorial on how to write your own rule sets for tagset-converter is being prepared.

[open in single page]

Options

Allowed options:
  --lang arg (=guess)               language
  --force-language                  force using specified language even if a 
                                    text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%.u) rules file

lemmatizers

lamerlemma

A simple lemmatizer. It can be used with predefined lemmatizers in binary format or with text files that contain a full form lexicon with optional grammatical information. It is also possible to create binary files from the specified text files for more efficienta and repeated use.

[show more] [open in single page]

Aliases

lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizer

Languages

de, en, es, fr, it, pl

Options

  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --binary-lexicon arg (=%ITSDATA%/%LANG%.bin)
                                        path to the lexicon in the binary 
                                        format
  --level arg (=3)                      set word processing level 0-3 (0 - do 
                                        nothing, 1 - return only base forms, 2 
                                        - add grammatical class and main 
                                        attributes, 3 - add detailed 
                                        attributes)
  --plain-text-lexicon arg              path to the lexicon in the plain text 
                                        format
  --save-binary-lexicon arg             as a side effect the lexicon in the 
                                        binary format is generated

morfologik

Morfologik is a Polish morphological analyzer and lemmatizer. It returns morphosyntactic information for each token: base forms, grammatical class and attributes.

Values returned by Morfologik are described on page Znaczniki Morfologika (in Polish). In general, Morfologik's tagset is similar to the tagset of National Corpus of Polish, so you can also see http://nkjp.pl/poliqarp/help/ense2.html for more details.

[open in single page]

Aliases

lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizer

Languages

pl

Examples

morfologik ! simple-writer --tags lemma

Returns all base forms for each word.

in:
Ala ma kota i psa.
out:
Al|Ala
mieć|mój
kot|kota
i
pies
morfologik ! simple-writer --tags lexeme

Returns all base forms and grammatical classes for each word.

in:
Wszędzie dobrze, ale w domu najlepiej.
out:
wszędzie+adv
dobro+subst|dobry+adv|dobrze+adv
ala+qub|ale+conj
w+prep|wiek+brev
dom+subst
dobrze+adv

Options

Allowed options:
  --level arg (=3)         set word processing level 0-3 (0 - do nothing, 1 - 
                           return only base forms, 2 - add grammatical class 
                           and main attributes, 3 - add detailed attributes)
  --dict arg (=morfologik) set dictionary, one of morfologik, morfeusz, 
                           combined
  --keep-original          keep original Morfologik's settings i.e. do not 
                           break brief forms

lexicons

bilexicon

Simple bilingual lexicon for machine translation.

[open in single page]

Languages

de, en, es, fr, it, pl, pl, pl, pl, pl

Examples

tp-tokenizer --lang pl ! morfologik ! bilexicon --lang pl --trg-lang en ! simple-writer --tags bilexicon

Read text, tokenize, produce morphologic interpretations of eachword, generate translations for all morphological interpetations,return simplified output: filtered to show only translations.

in:
powieść o kosmitach
out:
novel+subst|conduct+subst|succeed+verb
about+prep|of+prep|on+prep|for+prep|against+prep|at+prep
spaceman+subst|alien+subst|extraterrestrial+subst

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --trg-lang arg                        target language
  --binary-lexicon arg (=%ITSDATA%/%LANG%%TRGLANG%.bin)
                                        path to the lexicon in the binary 
                                        format
  --plain-text-lexicon arg              path to the lexicon in the plain text 
                                        format
  --save-binary-lexicon arg             as a side effect the lexicon in the 
                                        binary format is generated

delemma-pl

This processor is an alias for mapper with options --lang pl --in-tags token --out-tags form,inflector --consider-text --add-attributes --binary %ITSDATA%/pl-inflector.bin.

A valency lexicon for Polish.

[open in single page]

Languages

pl

generate-lexdb-forms-en

This processor is an alias for mapper with options --lang pl --in-tags form --out-tags token --consider-text --consider-category --clone-attributes --binary %ITSDATA%/en-lexdb-generation-forms.bin.

A lexicon of lexdb forms for English.

[open in single page]

Languages

pl

generate-lexdb-forms-es

This processor is an alias for mapper with options --lang pl --in-tags form --out-tags token --consider-text --consider-category --clone-attributes --binary %ITSDATA%/es-lexdb-generation-forms.bin.

A lexicon of lexdb forms for Spanish.

[open in single page]

Languages

pl

mapper

Universal lexicon for generic mapping tasks. Uses the following text file format:

key value_text [value_category [value_feature1,value_feature2,...,value_featureN]]

Whitespace separates fields. Value category and value features are optional, but value features imply and require the use of value categories. Value features are separated by ,. The number of value_features is not restricted. Value features of the form key=value are split accordingly, value features of the form value are treated as value=1.

Complex keys use , as a separator. In complex keys text precedes category and category precedes attributes. Attribute lists need to be sorted in a locale-neutral way.

[open in single page]

Options

Allowed options:
  --lang arg (=guess)          language
  --force-language             force using specified language even if a text 
                               was resognised otherwise
  --in-tags arg                map edges container all specified tags
  --out-tags arg               add tags to mapped edges
  --consider-text              consider text for key creation
  --consider-category          consider category for key creation
  --consider-attributes        consider attributes (sorted) for key creation
  --clone-text                 clone original text to mapped edge
  --clone-category             clone original category to mapped edge
  --clone-attributes           clone original attributes to mapped edge
  --unknown-clone-text         clone original text to mapped edge for unknown 
                               keys
  --unknown-clone-category     clone original category to mapped edge for 
                               unknown keys
  --unknown-clone-attributes   clone original attributes to mapped edge for 
                               unknown keys
  --set-text arg               set mapped edge text
  --set-category arg           set mapped edge category
  --set-attributes arg         set mapped edge attributes
  --unknown-set-text arg       set mapped edge text for unknown keys
  --unknown-set-category arg   set mapped edge category for unknown keys
  --unknown-set-attributes arg set mapped edge attributes for unknown keys
  --add-attributes             add attributes from lexicon to mapped edge
  --binary-lexicon arg         path to the lexicon in the binary format
  --plain-text-lexicon arg     path to the lexicon in the plain text format
  --save-binary-lexicon arg    as a side effect the lexicon in the binary 
                               format is generated

pl-valency

This processor is an alias for mapper with options --lang pl --in-tags gobio-tagset,lexeme --out-tags gobio-tagset,valency --consider-text --clone-text --clone-category --add-attributes --binary %ITSDATA%/pl-valency.bin.

A valency lexicon for Polish.

[open in single page]

Languages

pl

normalizers

iayko

Iayko is a normalizer for diachronic normalization

[open in single page]

Aliases

fst-normalizer, iayko-normalizer, thrax-normalizer

Languages

pl

Examples

iayko --lang pl --fsts transducers.txt

Normalize old text to modern version. List of transducers is given in file transducers.txt.

in:
naley wody w puhar i doday iedno iayko
out:
nalej wody w puchar i dodaj iedno iajko
iayko --grm my_grammar.grm --fst MyTransducer --save-far compiled_grammar.far

Perform normalization using transducer MyTransducer from text file with grammar written in Thrax (my_grammar.grm). Grammar is compiled to FAR archive and saved to compiled_grammar.far file.

in:
This is a sample sentence.
out:
Thif if a fample fentence.
iayko --lang pl

Basic usage of iayko. Normalizes old text to modern version, using the default set of finite-state rules.

in:
naley wody w puhar i doday iedno iayko
out:
nalej wody w puchar i dodaj jedno jajko
iayko --far my_grammar.far --fst MyTransducer

Perform normalization using transducer MyTransducer from FAR archive compiled_grammar.far

in:
This is a sample sentence.
out:
Thif if a fample fentence.
iayko --lang pl --fst Rule001_09

Normalize old text to modern version, using only the transducer Rule001_09 (Feliński’s rule „y/i→j”) from the default set of rules.

in:
naley wody w puhar i doday iedno iayko
out:
nalej wody w puhar i dodaj iedno iajko
iayko --lang pl --spec %ITSDATA%/%LANG%/all.far Rule001_09 %ITSDATA%/%LANG%/all.far Rule100_02

Normalize old text to modern version, using two transducers from the default set of rules: Rule001_09 (Feliński’s rule „y/i→j”) and Rule100_02 („puhar→puchar”).

in:
naley wody w puhar i doday iedno iayko
out:
nalej wody w puchar i dodaj iedno iajko

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --far arg (=%ITSDATA%/%LANG%/all.far) far archive with rules
  --fst arg                             fst name inside far
  --fsts arg (=%ITSDATA%/%LANG%/rules.txt)
                                        file with fst names to be used as a 
                                        cascade
  --spec arg                            specification of more far:fst pairs to 
                                        be used as cascade
  --grm arg                             text file with rules written in Thrax
  --md arg                              text file with rules written in Thrax 
                                        and their description in Markdown
  --save-far arg                        where to save the far archive compiled 
                                        from grm file
  --bypass-exceptions                   bypass exceptions
  --exceptions arg (=%ITSDATA%/%LANG%/exceptions.lst)
                                        a text file with list of exceptions

niema

Niema is a conditional normalizer for diachronic normalization

[open in single page]

Aliases

fst-normalizer, niema-normalizer, thrax-normalizer

Languages

pl

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --far arg (=%ITSDATA%/%LANG%/all.far) far archive with rules
  --fst arg                             fst name inside far
  --condition arg                       condition for fst
  --conditions arg (=%ITSDATA%/%LANG%/conditions.txt)
                                        file with conditions
  --spec arg                            specification of more far:fst pairs to 
                                        be used as cascade
  --grm arg                             text file with rules written in Thrax
  --md arg                              text file with rules written in Thrax 
                                        and their description in Markdown
  --save-far arg                        where to save the far archive compiled 
                                        from grm file
  --save-conditions arg                 where to save the conditions to file
  --bypass-exceptions                   bypass exceptions
  --exceptions arg (=%ITSDATA%/%LANG%/exceptions.lst)
                                        a text file with list of exceptions

simplenorm-normalizer

Normalizes tokens according to rules given in tsv files.

[show more] [open in single page]

Aliases

simple-normalizer, simplenorm

Languages

en, pl

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%/normalization.tsv)
                                        rule file

parsers

gobio

A deep parser based on the parser used in Translatica machine translation system.

Gobio operates on morfologically annotated text.

Gobio has some predefined sets of rules for several languages, but if you want, you can provide your own rule file with --rules option. The rules for gobio are in general a kind of context-free grammar rules. The tutorial on how to write your own rule sets for gobio is being prepared.

[show more] [open in single page]

Aliases

parse, parse-generator, parser

Languages

de, pl, test

Examples

--line-by-line gobio --lang pl --terminal-tag parse-terminal ! bracketing-writer --disamb --tags parse --opening-bracket %c[

Parse Polish sentences line by line and print simplified constituent tree for each sentence.

in:
Komputer czyta zdania.
Każde zdanie ma swoje drzewo składniowe.
Zrobiłem już trzy zdania.
out:
FR[R[Komputer]] fin[czyta] FR[R[zdania]].
FR[ZP[Każde] R[zdanie]] FP[P[ma]] FR[ZP[swoje] R[drzewo] FP[P[składniowe]]].
praet[Zrobiłem] FPS[PS[już]] FR[LG[trzy] R[zdania]].

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --edge-number-limit arg (=-1)         maximal number of edges inserted 
                                        between each two vertices
  --rules arg (=%ITSDATA%/%LANG%/rules.g)
                                        file with rules in text format
  --terminal-tag arg (=parse-terminal)  tag for terminal

puddle

A shallow parser based on the Spejd shallow parser originally developed at IPI PAN (http://zil.ipipan.waw.pl/Spejd/). For input, Puddle requires morphologically anotated text as produced, for instance, by the morfologik processor. It may also serve as a disambiguation tool itself or can be used chained with a POS-tagger (e.g. metagger processor).

Note that text needs to be annotated morphologically before passing it to puddle.

Currently, rules and tagsets are available for Polish only and used by default if not specified otherwise. The Polish parsing rules are for demonstration purposes only and are by no means complete.

[show more] [open in single page]

Aliases

parse, parse-generator, parser

Languages

fr, pl

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --tagset arg (=%ITSDATA%/%LANG%/tagset.%LANG%.cfg)
                                        tagset file
  --rules arg (=%ITSDATA%/%LANG%/rules.%LANG%)
                                        rules file

segmenters

srx-segmenter

Splits texts into segments (i.e. sentences) according to rules defined in an SRX (Segmentation Rules Exchange) file. In terms of psi-toolkit lattices segment edges are extracted from frag edges.

[show more] [open in single page]

Aliases

segment, segment-generator, segmenter

Languages

de, en, es, fi, fr, it, pl, ru, tr, xx

Examples

segment --lang pl ! write-simple --tags segment

Splits an Polish text into sentences.

in:
Zwiedziłem wiele krajów, m.in. Niemcy, Francję, Kanadę. Uwielbiam podróżować!
out:
Zwiedziłem wiele krajów, m.in. Niemcy, Francję, Kanadę.
 Uwielbiam podróżować!
segment --lang en ! write-simple --tags segment

Splits an English text into sentences.

in:
I've been to many countries, e.g. Germany, France, Canada. I enjoy travelling.
out:
I've been to many countries, e.g. Germany, France, Canada.
 I enjoy travelling.

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%/segmentation.srx)
                                        rule file
  --cascade                             force cascade mode
  --sentence-length-hard-limit arg (=1000)
                                        maximum length (in bytes, not in 
                                        characters) of a sentence (if, 
                                        according to rules, a sentence of a 
                                        greater length would be generated, a 
                                        sentence break is forced), zero turns 
                                        the limit off
  --sentence-length-soft-limit arg (=600)
                                        soft limit on the length (in bytes) of 
                                        a sentence (sentence break is forced 
                                        only on spaces), zero turns the limit 
                                        off

spell checkers

aspell

Runs aspell spellchecker on the tokenized text. This is a wrapper for the original aspell library. Not all options from aspell are mirrored in PSI-Toolkit (available options are listed below).

More about aspell see: http://aspell.net/.

[open in single page]

Aliases

spell, spell-check, spell-checker, spellcheck, spellchecker

Languages

af, am, ar, ast, az, be, bg, bn, br, ca, cs, csb, cy, da, de, de-alt, el, en, eo, es, et, fa, fi, fo, fr, fy, ga, gd, gl, grc, gu, gv, he, hi, hil, hr, hsb, hu, hus, hy, ia, id, is, it, kn, ku, ky, la, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, nb, nds, nl, nn, ny, or, pa, pl, pt_BR, pt_PT, qu, ro, ru, rw, sc, sk, sl, sr, sv, sw, ta, te, tet, tk, tl, tn, tr, uk, uz, vi, wa, yi, zu

Examples

tokenize ! aspell --lang en

Basic usage of aspell. Shows tokens with their corrected forms (if any).

in:
I enjoy travleling.
out:
I
enjoy
travleling|travelling|traveling|travailing|travellings|ravelling
.

Options

Allowed options:
  --lang arg (=guess)   language
  --force-language      force using specified language even if a text was 
                        resognised otherwise
  --limit arg (=5)      display limited number of correction prompts; if set to
                        zero, display all
  --size arg            the preferred size of the word list; his consists of a 
                        two char digit code describing the size of the list, 
                        with typical values of: 10=tiny, 20=really small, 
                        30=small, 40=med-small, 50=med, 60=med-large, 70=large,
                        80=huge, 90=insane
  --personal arg        personal word list file name (proceed by ./ if you want
                        to use current directory)
  --repl arg            replacements list file name (proceed by ./ if you want 
                        to use current directory)
  --ignore arg          ignore words with N characters or less
  --keyboard arg        the base name of the keyboard definition file to use
  --sug-mode arg        suggestion mode = `ultra' | `fast' | `normal' | `slow' 
                        | `bad-spellers'.
  --ignore-case         ignore case when checking words

taggers

inflector

Inflector is a small postprocessor tool to turn lemmatized text into their inflected forms based on context and other features. To turn a sequence of lemmas into their inflected use for instance:

delemma-pl ! inflector --lang pl

Try for instance the phrase "do dom kobieta z Wrocław" which should be turned into "do domu kobiety z Wrocławia". Unused forms have the "discarded" flag, forms without this tag are to be kept as chosen inflected versions. The current inflection model has been built on an automatically lemmatized text, better results can be achieved with a bigger and part-of-speech tagged or manually annotated training text.

[open in single page]

Options

Allowed options:
  --lang arg (=guess)                 language
  --force-language                    force using specified language even if a 
                                      text was resognised otherwise
  --model arg (=%ITSDATA%/%LANG%.blm) model file
  --iterations arg (=50)              number of iterations
  --unknown-pos arg (=ign)            unknown part of speech label
  --cardinal-number-pos arg (=card)   cardinal number part of speech label
  --proper-noun-pos arg (=name)       proper noun part of speech label
  --open-class-labels arg             open class labels
  --train                             training mode
  --save-text-model-files             saves text model files in training model

lang-guesser

The language identification tool uses created language bigram models to guess the input text language. If examined text is shorter than 24 characters, the language is guessed based on the occurrences of the non standard letters in each defined language.

[open in single page]

Aliases

guess-lang, guess-language

Examples

guess-language ! simple-writer --tags !en

Selects sentences in English from multi-language text.

in:
Die Familie Grimm war in Hanau beheimatet. 
Jacob Ludwig Carl Grimm, born on 4 January 1785, was 13 months older than his brother Wilhelm Carl Grimm.
Obaj bracia byli członkami Akademii Nauk w Berlinie i uczonymi (językoznawcami), o znacznym dorobku.
out:
Jacob Ludwig Carl Grimm, born on 4 January 1785, was 13 months older than his brother Wilhelm Carl Grimm.

Options

Allowed options:
  --default-language arg (=xx) Language code to be used for unrecognized 
                               strings, use 'none' to turn off putting a 
                               default language code
  --force                      All frags must be marked as text in some 
                               language
  --only-langs arg             Guesses language only from the given list of 
                               languages

metagger

Metagger (Maximum Entropy Tagger) is a simplistic part-of-speech tagger that can be easily custom-trained. For the tagger to work, it is necessary to include any morphological analyzer in the pipeline before the tagger is used.

Currently no pretrained part-of-speech models are available which renders the tagger unusable unless you provide your own models.

[show more] [open in single page]

Options

Allowed options:
  --lang arg (=guess)                 language
  --force-language                    force using specified language even if a 
                                      text was resognised otherwise
  --model arg (=%ITSDATA%/%LANG%.blm) model file
  --iterations arg (=50)              number of iterations
  --unknown-pos arg (=ign)            unknown part of speech label
  --cardinal-number-pos arg (=card)   cardinal number part of speech label
  --proper-noun-pos arg (=name)       proper noun part of speech label
  --open-class-labels arg             open class labels
  --train                             training mode
  --save-text-model-files             saves text model files in training model

tokenizers

detok

A simple detokenizer, usually used at the end of the process of translation. It composes the final text from token edges of a given language. Tokens are put in the order induced by the values of the SurfacePosition attribute (not the original order in the lattice!). Usually tokens are joined using spaces except for some puncuation marks (e.g. no space is inserted before a comma or after an opening bracket).

[open in single page]

Options

Allowed options:
  --lang arg (=guess)   language
  --force-language      force using specified language even if a text was 
                        resognised otherwise

tp-tokenizer

Splits texts into tokens (i.e. word-like units) according to SRX rules. Each rule specifies a regular expression and a token type that will be assigned to a sequence of characters matching the given regexp.

By default, tokenization rules from Translatica Machine Translation system are used (for Polish, English, Russian, German, French and Italian). Another SRX file can be specified with the --rules option.

Maximum token length can be set with --token-length-hard-limit and --token-length-soft-limit.

[open in single page]

Aliases

token-generator, tokenise, tokeniser, tokenize, tokenizer, tp-tokenise, tp-tokeniser, tp-tokenize

Languages

de, en, es, fi, fr, it, pl, ru, se, tr, xx

Examples

tokenize --lang en

Tokenizes an English text.

in:
I saw 134 things, e.g. a mirror and a maze.
out:
I
saw
134
things
,
e.g.
a
mirror
and
a
maze
.

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%/%LANG%.rgx)
                                        rule file
  --mapping arg (=common=%ITSDATA%/common.rgx;abbrev_%LANG%=%ITSDATA%/%LANG%/abbrev.rgx)
                                        mapping between include names and files
  --token-length-hard-limit arg (=1000) maximum length (in bytes, not in 
                                        characters) of a token (if, according 
                                        to rules, a token of a greater length 
                                        would be generated, a token break is 
                                        forced), zero turns the limit off
  --token-length-soft-limit arg (=950)  soft limit on the length (in bytes) of 
                                        a token (token break is forced only on 
                                        spaces), zero turns the limit off

translators

bonsai

Bonsai is a tree-to-string decoder for statistical machine translation. It requires its input sentences to be parsed before translation. For predefined translation rules, it is best to use default options without modification, as the weights have already been optimized.

You can now run a Polish to English toy translation model using the following pipe:

gobio --lang pl ! bonsai --lang pl --trg-lang en
[show more] [open in single page]

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --trg-lang arg                        target language
  --config arg (=%ITSDATA%/%LANG%%TRGLANG%/%LANG%%TRGLANG%.cfg)
                                        Path to configuration
  --rs arg                              Paths to translation rule sets
  --lm arg                              Paths to language models
  --stacksize arg (=20)                 Node translation stack size
  --max_trans arg (=20)                 Maximal number of transformations per 
                                        hyper edge
  --max_hyper arg (=20)                 Maximal number of hyper edges per 
                                        symbol
  --eps arg (=-1)                       Allowed transformation cost factor
  --nbest arg (=1)                      Display n best translations
  --verbose arg (=0)                    Level of verbosity: 0, 1, 2
  --pedantic                            Pedantic cost calculation (for 
                                        debugging)
  --mert                                Output for MERT (combine with nbest)
  --tm_weight arg                       Weights for translation model 
                                        parameters
  --rs_weight arg                       Weights for different translation rules
                                        sets
  --lm_weight arg                       Weights for different language models
  --word_penalty arg                    Weight for word penalty

transferer

Rule-based machine translation system. Uses rules expressed in a special Perl-like imperative programming language dedicated to the manipulation of syntax trees.

[open in single page]

Languages

pl, pl

Examples

translate-ples

Translates a Polish text into Spanish (with an alias).

in:
żółte kapelusze
out:
sombreros amarillos
gobio --lang pl ! bilexicon --lang pl --trg-lang en ! transferer --lang pl --trg-lang en

Translates a Polish text into English (full pipeline).

in:
na stadionie miejskim zaśpiewają półprofesjonalni muzycy
out:
the metropolitan stadium semi-professional musicians
translate-plen

Translates a Polish text into English (with an alias).

in:
na stadionie miejskim zaśpiewają półprofesjonalni muzycy
out:
the metropolitan stadium semi-professional musicians
gobio --lang pl ! bilexicon --lang pl --trg-lang es ! transferer --lang pl --trg-lang es

Translates a Polish text into Spanish (full pipeline).

in:
żółte kapelusze
out:
sombreros amarillos

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --trg-lang arg                        target language
  --rules arg (=%ITSDATA%/%LANG%%TRGLANG%.mti)
                                        rules file

writers

bracketing-writer

Tags the input text with the language units (e.g. parses) marked with various types of "brackets", e.g. with square brackets with the category prepended (NP[AP[very large] house] or with XML tags (<np><ap>very large</ap> house</np>).

[show more] [open in single page]

Options

Allowed options:
  --opening-bracket arg (=[)    the actual format of opening brackets
  --closing-bracket arg (=])    the actual format of closing brackets
  --tag-separator arg (=,)      separates tags
  --show-only-tags arg          limits the tag names that will appear in `%T` 
                                substitions
  --tags arg                    filters the edges by tags
  --av-pairs-separator arg (=,) separates the attribute-value pairs
  --av-separator arg (==)       separates the attribute and its value
  --show-attributes arg         the attributes to be shown
  --skip-symbol-edges           skip symbol edges
  --with-blank                  do not skip edges with whitespace text
  --no-collapse                 do not collapse duplicate edge labels
  --disambig                    choose only one partition

dot-writer

DOT Writer presents the results as a directed graph, described in DOT language used by GraphViz software. The same effect can be achieved using gv-writer --format canon but dot-writer does not use the GraphViz library to generate the output.

Options for dot-writer are similar to gv-writer's options. The main differences:

  • dot-writer has no option for specifying the output format since it cannot generate output in formats other than DOT;
  • for output's clarity, dot-writer produces not aligned monochromatic output; --align option forces nodes to be aligned left to right, --color option turns on the edge coloring.
[open in single page]

Aliases

write-dot

Examples

tokenize --lang en ! dot-writer

Tokenize English text and show the result as a graph described in DOT language.

in:
I've read “Fahrenheit 451”
out:
digraph G {
rankdir=LR
0 -> 4 [label="I've T"]
4 -> 5 [label="_ B"]
5 -> 9 [label="read T"]
9 -> 10 [label="_ B"]
10 -> 13 [label="“ I"]
13 -> 23 [label="Fahrenheit T"]
23 -> 24 [label="_ B"]
24 -> 27 [label="451 X"]
0 -> 30 [label="I've_read_“Fahrenheit_451” FRAG"]
0 -> 30 [label="I've_read_“Fahrenheit_451” TEXT"]
27 -> 30 [label="” I"]
}

Options

Allowed options:
  --align               force aligning nodes left to right
  --color               assign different colors to edges with different tags
  --show-symbol-edges   show symbol edges
  --show-tags           print edges' layer tags
  --tags arg            filter edges by specified tags
  --tree                show dependencies between edges instead of the content 
                        of the lattice

gv-writer

GV writer presents the results in a simple graphical form. For this purpose, it uses GraphViz library - a library for creating graphs (hence GV acronym).

If gv-writer is not available, you can obtain the graph representation of the lattice in DOT format using dot-writer.

[show more] [open in single page]

Aliases

chart-writer, draw, graph, graph-writer, write-chart, write-graph

Options

Allowed options:
  --disambig            choose only one partition
  --format arg (=svg)   choose output format
  --no-align            allow nodes to be not aligned left to right
  --no-color            make output monochromatic (black on white)
  --show-tags           print edges' layer tags
  --show-symbol-edges   show symbol edges
  --tags arg            show only edges tagged with specified tags
  --tree                show dependencies between edges instead of the content 
                        of the lattice

json-simple-writer

Writes output of a pipe in JSON format. See simple-writer for details.

[open in single page]

Aliases

write-simple-json

Examples

tokenize --lang en ! json-simple-writer

Simple JSON output for tokenized text.

in:
I enjoy travelling.
out:
["I","enjoy","travelling","."]
tokenize --lang en ! segmenter ! json-simple-writer --tags token --spec segment

Simple JSON output for tokenized and segmentized text with both segments and tokens.

in:
I've been to many countries, e.g. Germany, France, Canada. I enjoy travelling.
out:
[["I've","been","to","many","countries",",","e.g.","Germany",",","France",",","Canada","."],["I","enjoy","travelling","."]]

Options

Allowed options:
  --fallback-tags arg   tags that should be printed out if basic tags not found
  --linear              skip cross-edges
  --no-alts             skip alternative edges
  --with-blank          do not skip edges with whitespace text
  --tags arg (=token)   basic tag or tags separated by commas (conjunction) or 
                        semicolons (alternative)
  --spec arg            specification of higher-order tags
  --with-args           if set, returns text with annotation as a hash element

perl-simple-writer

Creates the Perl array. Used in Perl bindings.

[open in single page]

Options

Allowed options:
  --fallback-tags arg   tags that should be printed out if basic tags not found
  --linear              skips cross-edges
  --no-alts             skips alternative edges
  --with-blank          does not skip edges with whitespace text
  --tags arg (=token)   basic tag or tags separated by commas (conjunction) or 
                        semicolons (alternative)
  --spec arg            specification of higher-order tags
  --with-args           if set, then returns text with annotation as a hash 
                        element

psi-writer

PSI Writer prints the content of the lattice in PSI format.

By default, first line contains column description (as a comment). It can be skipped with --no-header option.

[open in single page]

Aliases

write-lattice, write-psi

Examples

txt-reader ! psi-writer

Read text into lattice and write its contents in PSI format.

in:
Lorem ipsum dolor sit amet
consectetur
adipiscing elit.
out:
## beg. len.  text         tags                  annot.text  annotations
01 0000 26    Lorem_ipsum_dolor_sit_amet frag,txt-reader Lorem_ipsum_dolor_sit_amet FRAG[]
02 0026 01    \n           ∅                     ∅           ∅
03 0027 11    consectetur  frag,txt-reader       consectetur FRAG[]
04 0038 01    \n           ∅                     ∅           ∅
05 0039 16    adipiscing_elit. frag,txt-reader   adipiscing_elit. FRAG[]

Options

Allowed options:
  --no-header           do not print the column description

simple-writer

Simple Writer prints the content of the lattice in a simple, human-readable way. By default, it writes token segments separated by newline symbols, skipping blank segments.

[show more] [open in single page]

Aliases

write, write-simple

Examples

segment ! tokenize --lang en ! simple-writer --tags symbol --sep / --spec token || segment \\n

Write text's symbols, tokens and segments separated by slashes, double vertical bars and newlines respectively.

in:
I am. You are. We are.
out:
I||a/m||.
||Y/o/u||a/r/e||.
||W/e||a/r/e||.
spell-check --lang en ! simple-writer --alt-sep /

Print spelling correction suggestions separated by /.

in:
Paast Perphect Continous
out:
Paast/Past/Pasta/Paste/Pasty/Psst/Pluperfect/Postponed/Stupefied/Postmarked/Postcode
Perphect/Perfect/Perfecta/Prefect/Perfects/Perfecter
Continous/Continuous/Continues/Contains/Continua/Contiguous
tokenize --lang en ! simple-writer

Tokenize English text.

in:
It's his 15th birthday.
out:
It's
his
15th
birthday
.

Options

Allowed options:
  --alt-sep arg (=|)    alternative edges separator
  --fallback-tags arg   tags that should be printed out if basic tags not found
  --linear              skip cross-edges
  --no-alts             skip alternative edges
  --with-blank          do not skip edges with whitespace text
  --sep arg (=
)        basic tag separator
  --spec arg            specification of higher-order tags and their separators
  --tags arg (=token)   basic tag or tags separated by commas (conjunction) or 
                        semicolons (alternative)

Other help resources