Use ugrapheme to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 👩🏽‍🔬🏴󠁧󠁢�

Search code, repositories, users, issues, pull requests...

submited by

Style Pass

2024-10-18 08:30:04

Use ugrapheme to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi is 4 instead of 13.

Trivial operations like reversing a string, getting the first and last character, etc. become easy not just for Latin and Emojis, but Devanagari, Hangul, Tamil, Bengali, Arabic, etc. Centering and justifying Emojis and non-Latin text in terminal output becomes easy again, as ugrapheme uses uwcwidth under the hood.

ugrapheme exposes an interface that's almost identical to Python's native strings and maintains a similar performance envelope, processing strings at hundreds of megabytes or even gigabytes per second:

Aside from passing the Unicode 16.0 UAX #29 Extended Grapheme Clusters grapheme break tests, ugrapheme correctly parses many difficult cases that break other libraries in Python and other languages.

As of this writing (October 2024), ugrapheme is among the fastest and probably among more correct implementations across all programming languages and operating systems.