Specification for reader > apertium-reader


Apertium-reader allows you to read text in various markup formats, such as: HTML documents, RTF files, Open-Office Writer odt or Microsoft Office 2007 formats: docx, xlsx and pptx. The default format for apertium-reader is html. To read text from doc files use doc-reader.

Apertium-reader uses format handling rules specified in XML and based on regular expressions. Current rule files comes from Apertium platform. It is possible to write new handling for any XML format, see

Note about license: XML files with rules come from Apertium platform and they are licensed under GNU General Public License.


apertium-reader --format rtf ! simple-writer --tags frag

Reads RTF file with --format rft option and writes only text fragments using simple-writer.

Text in first paragraph.
Second paragraph.
apertium-reader ! simple-writer --tags frag

Reads HTML file and outputs only text content.

Text in first paragraph.
Second paragraph.
apertium-reader --format docx ! simple-writer --tags frag

Reads DOCX file with --format docx option and writes only text fragments.

Przykładowy nagłówek.
Przykładowy tekst pierwszego akapitu.
Tekst w drugim akapicie.


Allowed options:
  --format arg (=html)     type of file for deformatting
  --specification-file arg specification file path
  --unzip-data arg (=1)    unzip compressed file formats like .pptx or .xlsx
  --keep-tags              keep formatting tags

