Unicode root-cause security issues for generating test cases
09 Sep 2008
When it comes to Unicode implementations, there's a rich set of test cases to perform. Realizing it is the start. Automating it is the next step.
Most Unicode-related security bugs can be categorized into the following root-causes:
Most Unicode-related security bugs can be categorized into the following root-causes:
Canonicalization
- Interpreting non-shortest form (e.g .UTF-8 encoding trickery)
- Other decoding issues
Absorption (over-consumption)
- Over-consuming invalid byte sequences or correcting rather than failing
- When <41 C2 C3 B1 42> becomes <41 42>
Character deletion and swallowing
- “deletion of noncharacters” (UTR-36)
- <scr[U+FEFF]ipt> becomes <script>
- Use replacement characters instead!
Interpreting Syntax replacements
- white space and line feeds
- E.g. when U+180E acts like U+0020
Best-fit mappings
- When σ becomes s
- When ′ becomes ‘
Buffer overruns
- Incorrect assumptions about string sizes (chars vs. bytes)
- Improper width calculations
Timing issues
- handling Unicode after security gates
- Sometimes handling Unicode before a gate can be a problem too! E.g. BOM handling
So, what do you think of those multibyte characters? Say in SQL, if you inject a multibyte character that contains as part of its composition a single quote ('), and it was intrepreted as a closing quote for the SQL, but evaded the filter. Do you think it can be grouped into the list above as well?