Taking a look at (some) tokenizers

submited by

Style Pass

2024-06-18 11:30:05

Recently I have been working on writing a tokenizer from scratch in Rust. In the process, I wanted to really understand the implementation of some commonly used tokenizers. Fun, I know.

If you already know Moses, you know. Silent nod. For everyone else, Moses is an NLP framework written in Perl, focused on statistical machine translation. It is super easy to install anywhere and everyone loves it because of it. Aside from language models and statistical machine translation utilities, it also includes tools to clean text, namely punctuation normalization, tokenization and cleaning corpora by limiting sentence length. From these I am concerned with the normalize-punctuation script and the tokenizer itself, because I don’t remember ever not using punctuation normalization along with the tokenizer.

The normalize-punctuation script (normalize-punctuation.perl ) removes carriage return characters (\r) and normalizes spacing around parentheses and punctuation marks. After that it normalizes punctuation marks (e.g. different types of quotation marks, hyphens and pseudo-spaces). There are a few special cases, like « » being the de-facto quotes for French as indicated in the code, which are normalized to " ". This is also the case for Spanish, where English quotation marks are only supposed to be used in nested quotations («Gritó "NOOOOOOO" cuando le pidieron instalar Moses », dijo su antiguo supervisor).

Taking a look at (some) tokenizers

Leave a Comment

Related Posts

Recent Posts

Search code, repositories, users, issues, pull requests...

Is Privacy Legal? Roman Storm’s Defense Rests in Tornado Cash Trial

Robot Riot | Cinema Sojourns

Securing embedded Linux: Secure Boot, encryption and A/B updates with Yocto

The “Careless People” Who Make Up Elite Institutions

Lions rugby tour: why visual training, including juggling, can be a secret weapon in elite sports

Winning Remote Work Strategies: Tools, Routines, and Talent to Boost Team Output

Search code, repositories, users, issues, pull requests...

Ask people to give you what you want – Herbert Lui

Electric weed control—how does it compare to conventional weed control methods?

Guarding the herd – managing database servers at scale

New Technique Could Increase Infant Heart Transplant by 20% | Duke Health

Production-ready agents with the OpenAI Agents SDK + Temporal

AlphaEarth Foundations helps map our planet in unprecedented detail

Did DarkForums Get Hit With an Exit Scam by an Administrator?

Device Bound Session Credentials

Palo Alto Networks Announces Agreement to Acquire CyberArk, the Identity Security Leader

Search code, repositories, users, issues, pull requests...

A better path to deployments. DevOps for everyone.

Opsqueue: a lightweight batch processing queue for the heaviest loads — now open-source!