Unicode attacks and test cases - Visual Spoofing, IDN homographattacks, and the Single Script Confusables
03 Dec 2008
More on lookalikes, confusables, IDN homograph attacks, and other fun stuff, continued from the previous post.
The Confusables
These types of visual attacks are attributed to what's known as 'the confusables' and have been documented in Unicode's Technical Report 36 and TR39. The confusables is a name given to scripts that essentially lookalike each other. The Unicode consortium has defined three main classes of confusable strings which are possible:
I want to investigate each one in turn. Because I'm simplifying things here, I may not be accurate in my use of the terms script, alphabet, letter, and so on. Linguistics people get it better than I do but for the rest of us, the term 'script' refers to:
Single-script confusables
These occur when letters from the same alphabet, or script, are used to give the same visual appearance. This definition should be extended to say that these occur when letters from either the same script, inherited script, or common script, are used together. For example, the following two combinations of Latin letters look identical:
If you take these apart, there's a big difference. While the letter 's' is the same in each, the 'o̷' and 'ø' are different. The first uses the Basic Latin 'o' with a combining diacritical mark named COMBINING SHORT SOLIDUS OVERLAY, which is considered an inherited script. To put it a different way, we have two atomic Unicode code points here, which together give the affect of a single character or letter. The second uses the atomic character LATIN SMALL LETTER O WITH STROKE. Let's take these apart and look at the Unicode code point values for each.
As you can see, the first 'o̷' gets formed from two Unicode code points, u006F and u0337. If you copy and paste that word into a text editor that supports Unicode (e.g. Notepad) and click backspace, you'll see the first backspace removes the combining diacritical mark, and the second removes the 'o'. Continuing with the example, the second 'ø' is made of a single Unicode code point u00F8 part of the Latin-1 Supplement Unicode block. At a lower level, because we're using different code points and bytes to achieve the same visual affect, we have a case of the confusables.
Let's take a closer look at what qualifies as a single-script confusable for the Latin lower-case letter 'a' - taken from the confusables table at http://unicode.org/reports/tr39/data/confusables.txt.
FF21 ; 0041 ; SA # ( A → A ) FULLWIDTH LATIN CAPITAL LETTER A → LATIN CAPITAL LETTER A
1D400 ; 0041 ; SA # ( 𝐀 → A ) MATHEMATICAL BOLD CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119809}
1D434 ; 0041 ; SA # ( 𝐴 → A ) MATHEMATICAL ITALIC CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119861}
Update: I just realized that some of the characters broke Wordpress so I've converted them all to NCR. In the above you can see three characters that all visually look similar to the Latin lowercase letter 'a'. The first number is the code point for the confusable, the second number 0041 is the code point for 'a', and the following stuff is some descriptive text.
The reason the 'Mathematical' characters are considered single-script confusables is because they have the common script class assigned to them.
Other scripts exist which have their own characters confusable with the Latin 'a', but those are considered mixed-script, which I'll go over in another post. For now I'll leave you with a list of test cases for single-script confusables. Some are more obvious than others, and it all depends on the font
The Confusables
These types of visual attacks are attributed to what's known as 'the confusables' and have been documented in Unicode's Technical Report 36 and TR39. The confusables is a name given to scripts that essentially lookalike each other. The Unicode consortium has defined three main classes of confusable strings which are possible:
- Single-script
- Mixed-script
- Whole-script
I want to investigate each one in turn. Because I'm simplifying things here, I may not be accurate in my use of the terms script, alphabet, letter, and so on. Linguistics people get it better than I do but for the rest of us, the term 'script' refers to:
A collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Russian is written with a subset of the Cyrillic script; Ukranian is written with a different subset. The Japanese writing system uses several scripts.
Single-script confusables
These occur when letters from the same alphabet, or script, are used to give the same visual appearance. This definition should be extended to say that these occur when letters from either the same script, inherited script, or common script, are used together. For example, the following two combinations of Latin letters look identical:
- so̷s
- søs
If you take these apart, there's a big difference. While the letter 's' is the same in each, the 'o̷' and 'ø' are different. The first uses the Basic Latin 'o' with a combining diacritical mark named COMBINING SHORT SOLIDUS OVERLAY, which is considered an inherited script. To put it a different way, we have two atomic Unicode code points here, which together give the affect of a single character or letter. The second uses the atomic character LATIN SMALL LETTER O WITH STROKE. Let's take these apart and look at the Unicode code point values for each.
- so̷s == \u0073\u006F\u0337\u0073
- søs == \u0073\u00F8\u0073
As you can see, the first 'o̷' gets formed from two Unicode code points, u006F and u0337. If you copy and paste that word into a text editor that supports Unicode (e.g. Notepad) and click backspace, you'll see the first backspace removes the combining diacritical mark, and the second removes the 'o'. Continuing with the example, the second 'ø' is made of a single Unicode code point u00F8 part of the Latin-1 Supplement Unicode block. At a lower level, because we're using different code points and bytes to achieve the same visual affect, we have a case of the confusables.
Let's take a closer look at what qualifies as a single-script confusable for the Latin lower-case letter 'a' - taken from the confusables table at http://unicode.org/reports/tr39/data/confusables.txt.
FF21 ; 0041 ; SA # ( A → A ) FULLWIDTH LATIN CAPITAL LETTER A → LATIN CAPITAL LETTER A
1D400 ; 0041 ; SA # ( 𝐀 → A ) MATHEMATICAL BOLD CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119809}
1D434 ; 0041 ; SA # ( 𝐴 → A ) MATHEMATICAL ITALIC CAPITAL A → LATIN CAPITAL LETTER A # {nfkc:119861}
Update: I just realized that some of the characters broke Wordpress so I've converted them all to NCR. In the above you can see three characters that all visually look similar to the Latin lowercase letter 'a'. The first number is the code point for the confusable, the second number 0041 is the code point for 'a', and the following stuff is some descriptive text.
The reason the 'Mathematical' characters are considered single-script confusables is because they have the common script class assigned to them.
Other scripts exist which have their own characters confusable with the Latin 'a', but those are considered mixed-script, which I'll go over in another post. For now I'll leave you with a list of test cases for single-script confusables. Some are more obvious than others, and it all depends on the font
- Microsoft → Micros𝗈ft
- Apple → Ap𝗉le
- Google → Google
- IBM → IBM
- Oracle → O𝗿𝗮cle
- Intel → Int𝗲𝗹
Let me ask some questions.
FYI, the Combining Diacritical Mark does not affect FF2 and Safari2 (SF + ver) on Mac. It affects FF3, SF3, OP9, IE7, Chrome0.4 (CH + ver).
- Are these visual appearances of the characters OS-level or App-level? (I would guesstimate it is App-level though.)
- If it is not OS-level, aren't the browsers breaking some normal functionality ( it's just text afterall, someone will want to print them, right?)? How is it remedied?
===
Next, for the previous post. =)
FYI. For Punycode, FF2 and SF2 on Mac is not affected. On Windows, FF3, SF3, IE7, CH0.4 is not affected, but OP9 is affected.
This goes back to my question in your original post. So IDN is basically inapplicable bearing Security in mind ( if DNS is still pure ASCII only, or that if IDN depends only on ASCII. )?
Moreover, this hints me that the above question is App-Level since only OP9 is affected.
===
What do you think?
BTW, REALLY great posts, I'm looking forward to the rest of 'em!
I think those browsers that you say were not affected didn't have as good Unicode implementations as they do today - funny that eh.
As far as Punycode it's up to the browsers to handle that and the display to the user. Yes in the end IDN seems to be mostly a convenience. Although it does represent a huge change to domain names on the surface, underneath it's still ASCII.