Unicode security attacks and test cases – Normalization expansion for buffer overflows
03 Apr 2009
Normalization, like casing operations, can cause changes to the number of characters and bytes in a string. In testing software, I want to know how to get the most bang for my buck – in other words, what’s the minimal input I can provide to cause the maximum character and byte exansion?
First step: Figure out what normalization operation your input is going through – NFC, NFD, NFCD, or NFKD.
Next step: Find the right input.
For example, if I pass in a character like U+2177 SMALL ROMAN NUMERAL EIGHT (ⅷ), I’ve passed in a single ‘character’ that takes three bytes [E2, 85, B7] to encode in UTF-8. If that character passes through a decomposed normalization form like NFKC or NFKD, then it has a compatibility mapping from one code point to four: U+0076 U+0069 U+0069 U+0069. Now those are all ASCII characters, so bytewise I didn’t really expand all that much, just one byte, but three extra characters.
Well there may be better cases than this one, just take a look at the maximum expansion factor table, courtesy of the Unicode Normalization FAQ:
First step: Figure out what normalization operation your input is going through – NFC, NFD, NFCD, or NFKD.
Next step: Find the right input.
For example, if I pass in a character like U+2177 SMALL ROMAN NUMERAL EIGHT (ⅷ), I’ve passed in a single ‘character’ that takes three bytes [E2, 85, B7] to encode in UTF-8. If that character passes through a decomposed normalization form like NFKC or NFKD, then it has a compatibility mapping from one code point to four: U+0076 U+0069 U+0069 U+0069. Now those are all ASCII characters, so bytewise I didn’t really expand all that much, just one byte, but three extra characters.
Well there may be better cases than this one, just take a look at the maximum expansion factor table, courtesy of the Unicode Normalization FAQ: