Unicode security attacks and test cases: character mappings and normalization for testing
12 Mar 2009
Point: Normalizing strings after validation is dangerous
Impact: filter evasion, enabling code execution
Are you testing a Web or other application in attempt to bypass restrictions on domain names? For example, what if you were testing a phishing filter and looking for ways to bypass the URL/IRI restrictions
Browsers today normalize a URL/IRI using for NFKC, in line with IDNA and Nameprep specifications. This is a composite decomposition normalization form, meaning that certain characters get mapped to other characters while applying a recursive canonical mapping pass and by a compatibility mapping pass. Following string decomposition, a canonical recomposition and reordering is applied. Ya, nothing to it.
In one sense, this can seem almost like a best-fit mapping, and it’s why certain characters like the horizontal ellipsis … (highlight it, there’s just one character there), will get mapped to three periods.
That is, U+2026 == U+002E U+002E U+002E.
It’s why the FULLWIDTH SOLIDUS ‘/’ gets mapped to a SOLIDUS ‘/’: U+FFOF == U+002F
Unicode provides the mapping tables for these operations, there’s no magic and there shouldn’t be anything vendor or platform specific happening to change those tables.
Some Unicode tools are available to help generate these characters.
If youre testing an application to verify it works as expected, try some of these characters in place of your typical ASCII characters. If the app normalizes with a compatibility form NFKC or NFKD, then these will get reduced to their ASCII equivalents somewhere along the way.
U+1D400 MATHEMATICAL BOLD CAPITAL A U+0041
U+1D401 MATHEMATICAL BOLD CAPITAL B U+0042
U+1D402 MATHEMATICAL BOLD CAPITAL C U+0043
U+1D403 MATHEMATICAL BOLD CAPITAL D U+0044
U+1D404 MATHEMATICAL BOLD CAPITAL E U+0045
U+1D405 MATHEMATICAL BOLD CAPITAL F U+0046
U+1D406 MATHEMATICAL BOLD CAPITAL G U+0047
U+1D407 MATHEMATICAL BOLD CAPITAL H U+0048
U+1D408 MATHEMATICAL BOLD CAPITAL I U+0049
U+1D409 MATHEMATICAL BOLD CAPITAL J U+004A
U+1D40A MATHEMATICAL BOLD CAPITAL K U+004B
U+1D40B MATHEMATICAL BOLD CAPITAL L U+004C
U+1D40C MATHEMATICAL BOLD CAPITAL M U+004D
U+1D40D MATHEMATICAL BOLD CAPITAL N U+004E
U+1D40E MATHEMATICAL BOLD CAPITAL O U+004F
U+1D40F MATHEMATICAL BOLD CAPITAL P U+0050
U+1D410 MATHEMATICAL BOLD CAPITAL Q U+0051
U+1D411 MATHEMATICAL BOLD CAPITAL R U+0052
U+1D412 MATHEMATICAL BOLD CAPITAL S U+0053
U+1D413 MATHEMATICAL BOLD CAPITAL T U+0054
U+1D414 MATHEMATICAL BOLD CAPITAL U U+0055
U+1D415 MATHEMATICAL BOLD CAPITAL V U+0056
U+1D416 MATHEMATICAL BOLD CAPITAL W U+0057
U+1D417 MATHEMATICAL BOLD CAPITAL X U+0058
U+1D418 MATHEMATICAL BOLD CAPITAL Y U+0059
U+1D419 MATHEMATICAL BOLD CAPITAL Z U+005A
Bonus character!: Sometimes compatibility mappings aren’t good enough for testing, maybe the app performs an NFC or NFD normalization instead. Here’s a character you can try that maps canonically to the ASCII/Latin letter ‘K’ in any of the four Unicode normalization forms:
U+212A KELVIN SIGN U+004B
Impact: filter evasion, enabling code execution
Are you testing a Web or other application in attempt to bypass restrictions on domain names? For example, what if you were testing a phishing filter and looking for ways to bypass the URL/IRI restrictions
Browsers today normalize a URL/IRI using for NFKC, in line with IDNA and Nameprep specifications. This is a composite decomposition normalization form, meaning that certain characters get mapped to other characters while applying a recursive canonical mapping pass and by a compatibility mapping pass. Following string decomposition, a canonical recomposition and reordering is applied. Ya, nothing to it.
In one sense, this can seem almost like a best-fit mapping, and it’s why certain characters like the horizontal ellipsis … (highlight it, there’s just one character there), will get mapped to three periods.
That is, U+2026 == U+002E U+002E U+002E.
It’s why the FULLWIDTH SOLIDUS ‘/’ gets mapped to a SOLIDUS ‘/’: U+FFOF == U+002F
Unicode provides the mapping tables for these operations, there’s no magic and there shouldn’t be anything vendor or platform specific happening to change those tables.
Some Unicode tools are available to help generate these characters.
If youre testing an application to verify it works as expected, try some of these characters in place of your typical ASCII characters. If the app normalizes with a compatibility form NFKC or NFKD, then these will get reduced to their ASCII equivalents somewhere along the way.
U+1D400 MATHEMATICAL BOLD CAPITAL A U+0041
U+1D401 MATHEMATICAL BOLD CAPITAL B U+0042
U+1D402 MATHEMATICAL BOLD CAPITAL C U+0043
U+1D403 MATHEMATICAL BOLD CAPITAL D U+0044
U+1D404 MATHEMATICAL BOLD CAPITAL E U+0045
U+1D405 MATHEMATICAL BOLD CAPITAL F U+0046
U+1D406 MATHEMATICAL BOLD CAPITAL G U+0047
U+1D407 MATHEMATICAL BOLD CAPITAL H U+0048
U+1D408 MATHEMATICAL BOLD CAPITAL I U+0049
U+1D409 MATHEMATICAL BOLD CAPITAL J U+004A
U+1D40A MATHEMATICAL BOLD CAPITAL K U+004B
U+1D40B MATHEMATICAL BOLD CAPITAL L U+004C
U+1D40C MATHEMATICAL BOLD CAPITAL M U+004D
U+1D40D MATHEMATICAL BOLD CAPITAL N U+004E
U+1D40E MATHEMATICAL BOLD CAPITAL O U+004F
U+1D40F MATHEMATICAL BOLD CAPITAL P U+0050
U+1D410 MATHEMATICAL BOLD CAPITAL Q U+0051
U+1D411 MATHEMATICAL BOLD CAPITAL R U+0052
U+1D412 MATHEMATICAL BOLD CAPITAL S U+0053
U+1D413 MATHEMATICAL BOLD CAPITAL T U+0054
U+1D414 MATHEMATICAL BOLD CAPITAL U U+0055
U+1D415 MATHEMATICAL BOLD CAPITAL V U+0056
U+1D416 MATHEMATICAL BOLD CAPITAL W U+0057
U+1D417 MATHEMATICAL BOLD CAPITAL X U+0058
U+1D418 MATHEMATICAL BOLD CAPITAL Y U+0059
U+1D419 MATHEMATICAL BOLD CAPITAL Z U+005A
Bonus character!: Sometimes compatibility mappings aren’t good enough for testing, maybe the app performs an NFC or NFD normalization instead. Here’s a character you can try that maps canonically to the ASCII/Latin letter ‘K’ in any of the four Unicode normalization forms:
U+212A KELVIN SIGN U+004B