Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

Tutorial

What is PSI-Toolkit?

PSI-Toolkit is an open-source set of linguistic tools for processing text written in various formats and languages. In particular, it deals with:

  • text pre-processing
  • sentence splitting
  • tokenization
  • lexical and morhological analysis
  • syntactical/semantic parsing
  • machine translation

The PSI tools process mainly but not exclusively texts written in Polish. At the moment supported languages are also: English, French, Turkish, German, Russian, Italian and Finnish.

How to use the PSI-Toolkit?

The easiest way is to use our web-service called PSI-Server. You may also download the Linux packages, install them at your computer and use locally with the command-line program psi-pipe.

The installation process is described in the Installation guide. Note that after installing PSI-Toolkit from the precompiled packages you will also get your own instance of PSI-Server (identical to this one). You can launch the server locally by simple typing the psi-server command. Then, the client will be available at http://localhost:3000 in your web browser.

Usage basics

A typical use case consists in delivering text for processing and specifying what you want to be done e.g. tokenize or lemmatize. The text should be firstly read, then processed (annotated) and finally displayed or written to a file. Therefore a typical PSI command is a sequence of three tasks: read, annotate and write separated by the ! character (notice that spaces around ! character are obligatory):

a reader ! a sequence of processors ! a writer

The list of all available readers, processors and writers is available in Documentation. You may find exemplary command pipelines in Getting started.

Example

Suppose you need to tokenize the sentence Piękna gra trwała długo.. It can be done with the command:

read-text ! tokenize ! write-simple

The sequence of tasks (processors) determine the sequence of text processing and should be logically correct (reading at the beginning, writing at the end, lemmatization placed after tokenization, etc.).

The expected result is (click here to run this example in new window):

Piękna
gra
trwała
długo
.

To lemmatize the same sentence and display all lemmas for each word, the following pipeline is used:

read-text ! tokenize ! lemmatize ! write-simple --tags lemma

The result is (run example):

piękno|piękny
gra|grać
trwać|trwała|trwały
długo

Note that in order display information about lemmas you need to set the --tags lemma switch for the write-simple command.

Processor switches

As was shown in the previous example, the use of the PSI processor may be customized be means of switches (options). Most of the options require a value, as it is in the case of the previously mentioned --tags option (the value was lemma).

You may find all available options for each processor in Documentation.

Auto completion

Actually, you do not have to specify all processors in your command. If you "forget" a processor, the pipeline will be supplemented by the default processor. For example, a simple command (run example):

tokenize

returns exactly the same results as a more sophisticated one:

read-text ! tokenize ! write-simple

as the missing tasks are complemented automatically with default processors. The following two pipelines will produce the same output:

  • read-text ! tokenize ! lemmatize ! write-simple --tags lemma
  • lemmatize ! write-simple --tags lemma (a default reader and a default tokenizer will be auto-complemented).

Aliases

Most of the tasks can be expressed in a few equivalent ways:

  • as a verb (e.g. tokenize)
  • as a noun (e.g. tokenizer)
  • as a name of a specific tool (e.g. tp-tokenizer).

Alternative names for processors are called aliases. Aliases for each processor are listed in Documentation and in List of aliases.

For example the following pipelines are equivalent:

  • segment ! write --tags segment
  • srx-segmenter ! simple-writer --tags segment

More examples

You may test more examples of processing pipelines with the random example button located under write command input field on the main PSI-Server website. The same examples can be found in Documentation along with descriptions of each processor.

Other features

PSI-readers

Readers process input either from keyboard or from a file. Respective PSI-Toolkit readers process various text formats: txt, html, doc, docx, xlsx, pptx and pdf, e.g. pdf-reader reads pdf files . You may also try using guessing-reader to automatically recognize the input file format.

Working with PSI-lattice

Processed data are stored in the internal data structure, a so-called PSI-lattice. If you want to see how a PSI-lattice looks like, try this command (take a simple input text, otherwise the structure is too big):

tokenize ! graph --format png

You can also display PSI-lattice in a textual format with the psi-writer tool (alias is write-psi):

tokenize ! write-psi

For the input: Piękna gra trwała długo. it returns:

## beg. len.  text         tags                  annot.text  annotations
01 0000 07    Piękna       !pl,token             Piękna      T
02 0007 01    _            !pl,token             _           B
03 0008 03    gra          !pl,token             gra         T
04 0011 01    _            !pl,token             _           B
05 0012 07    trwała       !pl,token             trwała      T
06 0019 01    _            !pl,token             _           B
07 0020 06    długo        !pl,token             długo       T
08 0000 27    Piękna_gra_trwała_długo. frag,txt-reader Piękna_gra_trwała_długo. FRAG[]
09 0000 27    Piękna...go. !pl,lang-guesser,text Piękna_gra_trwała_długo. TEXT[8]
10 0026 01    .            !pl,token             .           I

For the detailed description of the PSI-format see PSI-format.

Remember that you can filter the display of the PSI-lattice by using the tag option of simple-writer. Here are two examples: (run example):

  • tokenize ! write-simple --tags token
  • lemmatize ! write-simple --tags lemma

Command-line usage

After successful installation of the PSI-Toolkit package, you may use a command-line program called psi-pipe.

You can pass any command pipeline as an argument for the psi-pipe program. To pass the input text to the standard input you need to use the echo command. Apply the vertical bar | to pipe echo and psi-pipe.

$ echo "Piękna gra trwała długo." | psi-pipe lemmatize ! write-simple --tags lemma

If you want to pass a file instead of raw text, you may use the cat command:

$ cat ../path-to-your/file.docx | psi-pipe lemmatize ! write-simple --tags lemma

Other help resources