For Developers‎ > ‎Design Documents‎ > ‎

IDN in Google Chrome

Background

Back in the day, hostnames could only consist of the letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://öbb.at is transformed to http://xn--bb-eka.at. The transformed form is called punycode.

Ideally, user agents could always display the Unicode version of a hostname. However, different characters from different languages can look very similar, and this can make phishing attacks possible. For example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (http://xn--eby-7cd.com/), which would easily be mistaken for http://ebay.com. This is called a homograph attack.

In a perfect world, domain registrars would not allow such nefarious domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. For some TLDs that are meant to be international, this would be nontrivial to do (e.g., .com).

As a result, all browsers try to protect against homograph attacks by displaying punycode instead of the original IDN if the hostname does not fulfill certain properties. They try to do this in a way that allows IDN to be shown for valid hostnames, but protects against phishing.

Google Chrome's IDN policy

Google Chrome decides if it should show IDN or punycode for each component of a hostname separately. To decide if a component should be shown in IDN form, Google Chrome uses an algorithm that depends on the languages that the user claims to understand.  On all platforms, these languages can be configured in the Google Chrome's Fonts and Languages dialog. On Mac OS X, they used to be derived from the system language, but  not any more. The algorithm is:
  • Build the set of characters in the component.
  • If that set contains at least one blacklisted character, punycode is displayed.
  • If the language list is empty, the component is displayed in IDN form exactly if it contains only characters used in a single language and the following steps are skipped.
  • The characters 0-9, +, -, [, ], _, and the space character are removed from the set.
  • If all the remaining characters belong to a single of the languages in the language list as described below, the IDN form is displayed. Otherwise punycode is shown.
(This is implemented by IsIDNComponentSafe() in net/base/net_util.cc.)

The characters belonging to a given language are specified by the Unicode Consortium's CLDR dataset (schema; you can also view the exemplary characters of every language online, here's for example the set for Japanese). In Google Chrome, this is implemented by the ICU function ulocdata_getExemplarSet(), with the characters a-z added for whitelisted languages whose glyphs can't be confused with a-z. The whitelisted languages currently are Chinese (zh), Japanese (ja), and Korean (ko).

Consequences / Examples

Google Chrome will display IDN for components of a hostname consisting solely of characters that belong to one of the languages selected in the language settings—even on .com and .net domains, not only in domains native to that language. For example, http://россия.net will be displayed in IDN form if you claim to speak Russian or another language written in Cyrillic, and as punycode otherwise. Likewise, http://私の団体も.jp/ will be shown in IDN form only if you claim to speak Japanese in Google Chrome's options.

Google Chrome will always display punycode for components of a hostname that contains characters not in the main exemplary character set of any language. For example, http://☃.net/ will always be displayed as punycode in Google Chrome.

Google Chrome will always display punycode for components that mix letters from multiple languages. For example, there is not a single language that contains all characters found in http://şøñđëřżēıċħęŋđőmæîņĭśŧşũþėŗ.de, so this will be shown as punycode. Likewise, http://ebаy.com (with a Cyrillic "а") will always be shown as punycode, even if both English and Russian are in the accepted languages. This is true even if the domain is below a TLD whose registry takes care to protect against homograph attacks.

There are cases where confusion is still possible with this algorithm.  For example, the first component of http://сахар.ru is entirely Cyrillic, while the first component of http://caxap.ru is entirely Latin.  If you claim to speak Russian or another Cyrillic language, both of these will be shown in IDN form.

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to what Google Chrome does.

Firefox

Firefox uses a script mixing detection algorithm based on the "Moderately Restrictive" profile of Unicode Technical Report 39. Domains of any single script, any single script + Latin, or a small whitelist of other combinations are displayed as Unicode; everything else is Punycode.

Opera

Like Firefox, Opera has a whitelist of TLDs and shows IDN only for these whitelisted TLDs.

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.


Comments