Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

Specification for annotator > tokenizer > tp-tokenizer

tp-tokenizer

Splits texts into tokens (i.e. word-like units) according to SRX rules. Each rule specifies a regular expression and a token type that will be assigned to a sequence of characters matching the given regexp.

By default, tokenization rules from Translatica Machine Translation system are used (for Polish, English, Russian, German, French and Italian). Another SRX file can be specified with the --rules option.

Maximum token length can be set with --token-length-hard-limit and --token-length-soft-limit.

Aliases

token-generator, tokenise, tokeniser, tokenize, tokenizer, tp-tokenise, tp-tokeniser, tp-tokenize

Languages

de, en, es, fi, fr, it, pl, ru, se, tr, xx

Examples

tokenize --lang en

Tokenizes an English text.

in:
I saw 134 things, e.g. a mirror and a maze.
out:
I
saw
134
things
,
e.g.
a
mirror
and
a
maze
.

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%/%LANG%.rgx)
                                        rule file
  --mapping arg (=common=%ITSDATA%/common.rgx;abbrev_%LANG%=%ITSDATA%/%LANG%/abbrev.rgx)
                                        mapping between include names and files
  --token-length-hard-limit arg (=1000) maximum length (in bytes, not in 
                                        characters) of a token (if, according 
                                        to rules, a token of a greater length 
                                        would be generated, a token break is 
                                        forced), zero turns the limit off
  --token-length-soft-limit arg (=950)  soft limit on the length (in bytes) of 
                                        a token (token break is forced only on 
                                        spaces), zero turns the limit off

Other help resources