Browser user-agents and variable-width utf-8 encoding issues

18 May 2008

Table 3.1B from Corrigendum #1: UTF-8 Shortest Form provides the basis for some interesting test cases. Hopefully I'll have something to report about this this soon. In the meantime John Hernandez and I are structuring tests across all browsers to look for new XSS vectors through character absorption, swallowing, and exclusion.

**Table 3.1B. Legal UTF-8 Byte Sequences**
Code Points	1st Byte	2nd Byte	3rd Byte	4th Byte
`U+0000..U+007F`	`00..7F`
`U+0080..U+07FF`	`C2..DF`	`80..BF`
`U+0800..U+0FFF`	`E0`	`A0..BF`	`80..BF`
`U+1000..U+FFFF`	`E1..EF`	`80..BF`	`80..BF`
`U+10000..U+3FFFF`	`F0`	`90..BF`	`80..BF`	`80..BF`
`U+40000..U+FFFFF`	`F1..F3`	`80..BF`	`80..BF`	`80..BF`
`U+100000..U+10FFFF`	`F4`	`80..8F`	`80..BF`	`80..BF`

lookout.net

Browser user-agents and variable-width utf-8 encoding issues