Use ugrapheme to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 馃懇馃徑鈥嶐煍煆大爜爜Ⅲ

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-10-18 08:30:04

Use ugrapheme to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 馃懇馃徑鈥嶐煍煆大爜爜Ⅲ爜丑爜s爜大爜縃i is 4 instead of 13.

Trivial operations like reversing a string, getting the first and last character, etc. become easy not just for Latin and Emojis, but Devanagari, Hangul, Tamil, Bengali, Arabic, etc. Centering and justifying Emojis and non-Latin text in terminal output becomes easy again, as ugrapheme uses uwcwidth under the hood.

ugrapheme聽exposes an interface that's almost identical to Python's native strings and maintains a similar performance envelope, processing strings at hundreds of megabytes or even gigabytes per second:

Aside from passing the Unicode 16.0 UAX #29 Extended Grapheme Clusters grapheme break tests, ugrapheme correctly parses many difficult cases that break other libraries in Python and other languages.

As of this writing (October 2024), ugrapheme is among the fastest and probably among more correct implementations across all programming languages and operating systems.

Leave a Comment