Taking a look at (some) tokenizers

submited by
Style Pass
2024-06-18 11:30:05

Recently I have been working on writing a tokenizer from scratch in Rust. In the process, I wanted to really understand the implementation of some commonly used tokenizers. Fun, I know.

If you already know Moses, you know. Silent nod. For everyone else, Moses is an NLP framework written in Perl, focused on statistical machine translation. It is super easy to install anywhere and everyone loves it because of it. Aside from language models and statistical machine translation utilities, it also includes tools to clean text, namely punctuation normalization, tokenization and cleaning corpora by limiting sentence length. From these I am concerned with the normalize-punctuation script and the tokenizer itself, because I don’t remember ever not using punctuation normalization along with the tokenizer.

The normalize-punctuation script (normalize-punctuation.perl ) removes carriage return characters (\r) and normalizes spacing around parentheses and punctuation marks. After that it normalizes punctuation marks (e.g. different types of quotation marks, hyphens and pseudo-spaces). There are a few special cases, like « » being the de-facto quotes for French as indicated in the code, which are normalized to " ". This is also the case for Spanish, where English quotation marks are only supposed to be used in nested quotations («Gritó "NOOOOOOO" cuando le pidieron instalar Moses », dijo su antiguo supervisor).

Leave a Comment