Make sure you know which Unicode version is supported by your programming language version · m's blog

submited by

Style Pass

2021-07-16 14:00:06

While enhancing CATS I recently added a feature to send requests that include single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is configurable in CATS, but not the focus of this article).

I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means:

I have a test service I use for testing new CATS fuzzers. The idea was to simply use the String’s replaceAll() method to remove all these characters from the String.

Even though I have CATS compiled to Java 8, I mainly use JDK11+ for development. At some point I had CATS running in a CD pipeline with JRE8. The emoji test cases generated by the CATS Fuzzers, started to fail, even though they were successfully passing on my local box (and on other CD pipelines). I went through the log files, the request payloads were initially constructed and displayed ok, with the emoji properly printed, but while running some pattern matching on the string the result was printed as sometext?andanother. The ? is where the emoji was supposed to be. Further investigation led to the conclusion that what caused the mishandling of the emoji was the JRE version (which might be obvious for the 99.999% of Java devs out there). Which is actually expected as Java 8 is compatible with Unicode 6.2, while 🥶 is part of Unicode 11.