Table 3.1B from
Corrigendum #1: UTF-8 Shortest Form provides the basis for some interesting test cases. Hopefully I'll have something to report about this this soon. In the meantime John Hernandez and I are structuring tests across all browsers to look for new XSS vectors through character
absorption,
swallowing, and
exclusion.
Table 3.1B. Legal UTF-8 Byte Sequences
Code Points |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
---|
U+0000..U+007F |
00..7F |
|
|
|
---|
U+0080..U+07FF |
C2..DF |
80..BF |
|
|
---|
U+0800..U+0FFF |
E0 |
A0..BF |
80..BF |
|
---|
U+1000..U+FFFF |
E1..EF |
80..BF |
80..BF |
|
---|
U+10000..U+3FFFF |
F0 |
90..BF |
80..BF |
80..BF |
---|
U+40000..U+FFFFF |
F1..F3 |
80..BF |
80..BF |
80..BF |
---|
U+100000..U+10FFFF |
F4 |
80..8F |
80..BF |
80..BF |
---|