Specification for annotator > tokenizer > tp-tokenizer
Splits texts into tokens (i.e. word-like units) according to SRX rules. Each rule specifies a regular expression and a token type that will be assigned to a sequence of characters matching the given regexp.
By default, tokenization rules from Translatica Machine Translation
system are used (for Polish, English, Russian, German, French and
Italian). Another SRX file can be specified with the
Maximum token length can be set with
Aliasestoken-generator, tokenise, tokeniser, tokenize, tokenizer, tp-tokenise, tp-tokeniser, tp-tokenize
Languagesde, en, es, fi, fr, it, pl, ru, se, tr, xx
tokenize --lang en
Tokenizes an English text.
I saw 134 things, e.g. a mirror and a maze.
I saw 134 things , e.g. a mirror and a maze .
Allowed options: --lang arg (=guess) language --force-language force using specified language even if a text was resognised otherwise --rules arg (=%ITSDATA%/%LANG%/%LANG%.rgx) rule file --mapping arg (=common=%ITSDATA%/common.rgx;abbrev_%LANG%=%ITSDATA%/%LANG%/abbrev.rgx) mapping between include names and files --token-length-hard-limit arg (=1000) maximum length (in bytes, not in characters) of a token (if, according to rules, a token of a greater length would be generated, a token break is forced), zero turns the limit off --token-length-soft-limit arg (=950) soft limit on the length (in bytes) of a token (token break is forced only on spaces), zero turns the limit off