We’re excited to announce the release of the `pg_tiktoken` extension on Neon. This new Postgres extension provides fast and efficient tokenization using the BPE (Byte Pair Encoding) algorithm. `pg_tiktoken` is a wrapper around OpenAI’s tokenizer, known for its speed and performance in handling natural language processing tasks.
pg_tiktoken solves the problem of tokenizing text data within a Postgres database. The tiktoken_encode function allows you to tokenize text inputs and returns a tokenized output, making it easier to analyze and process text data for various applications. The tiktoken_count function enables users to return the number of tokens in a text, which is useful for checking text length limits, like those imposed by OpenAI’s language models.
Language models process text in chunks known as tokens. In English, a token can range from a single character to a complete word such as “a” or “apple.” In certain languages, tokens can even be shorter than one character or longer than one word.