Domain names containing special characters and umlauts (like “ü” in “grün.info”),
can be registered at more and more top level domains.
In the following, we want to provide an insight into the technical foundations
of the topic “Internationalized Domain Names” (IDN).
All you need in order to be able to use these domain names, are modern programs
like the browsers Mozilla 1.4, Netscape 7.1 or Opera 7.2.
They are completely preconfigured and ready to go.
Some browsers still have to evolve
Your browser has to support the conversion into Punycode in order to
be able to use domain names with local language characters. (See below for an
explanation of Punycode.) In contrast to the browsers listed above,
Microsoft's Internet Explorer currently doesn't offer support for
IDN. In order to become “IDN-aware”, it has to be enhanced with so-called
Plugins, the most famous being VeriSign's
I-Nav-Plugin.
Use this link to simply
test if your
system is configured to use IDN. The link takes you to our test page
grün.knipp.de.
If a new browser window shows up with a green (german: grün) page, you are
ready to use Internationalized Domain Names.
If your browser cannot display the test page, you can transform the domain
name “grün.knipp.de” into Punycode yourself by using
our conversion tool.
You can then load the page simply by copying the so acquired character code
to the address line of your browser.
When we are talking about introducing german umlaut characters and other local
language characters here, this is about domain names only. The contents of a
website have been able to handle characters like that ever since the beginning.
For example, Knipp has once created a Japanese website called
“Germany Shop”, which was used to sell typical german products in Japan.
The Internet first spread in the United States of America. The
English language does barely know any special characters. Therefore,
the complete technical infrastructure and the domain name system was
based on the characters from “a” to “z”, the digits from “0” to “9”
and the hyphen. Those domain names are also called LDH-Names
(Letter, Digits, Hyphen).
To break this restriction, it would be necessary to
completely replace the equipment and software, including
all switching centers for e-mail, www-proxys, firewalls, etc. This
would be very complex and expensive and is virtually impossible.
Power On for Extensions
Scientists, among them technicians, computer scientists and also linguists,
have thought up an alternative solution to replacing the complete
infrastructure. Only the end devices, or - to be more exact - the
“end software” has to be changed. In other words, only the browsers and
the e-mail programs have to understand special characters.
Software with this capability is then called "IDN-aware". It unambigously maps
every domain name containing a local language character to a new name, which
in turn contains only characters from “a” to “z”, the digits from “0” to “9”
and the hyphen.
It is the task of the
Unicode Consortium
to determine which characters can be mapped throughout the world.
At present, about 70.000 characters are defined. The registries can choose
what subset out of this huge amount of characters they want to allow for their
top level domain. Afilias, for example, the registry for .info domains,
has initially decided to only allow the umlaut characters “ä”, “ö” and
“ü” as well as the the german character “ß”.
Using the rear exit to avoid problems
The conversion used by the end software is called Punycode conversion.
It is defined in a sort of industry standard in
RFC 3492.
It was a subject of consideration that the converted name should give a
reasonable idea of what the original name sounds like. Example:
You can use our
conversion site
to try additional conversions by yourself. You can also use it to
re-convert Punycode format to the original notation.
By the way, each character string that is separated by dots is
converted individually. To give an example, the sub-level domain
“käse.müller.info” is converted to “xn--kse-qla.xn--mller-kva.info”.
This is called individual label conversion.
Every Punycode label consists of up to 3 parts:
Part
Example
Explanation
prefix
xn--
The prefix always consists of this character string.
It indicates that the label is in Punycode format.
For this reason, many registries have disallowed the registration of
common domain names that begin with this character string
(which probably is of not much use in everyday's life anyway).
root
mller
These are all characters of the label which remain after
deleting the special characters. If no conventional characters
are used the original name, this part remains empty, as shown in the
example above.
encoding
-kva
The enconding defines which special characters exist at which
position of the original name. The encoding is based on a very
complex formula. If the root is empty, even the character
“-” is omitted which usually separates root and encoding.
Die grünen Äpfel = the green apples
The technical regulations for the domain system define that labels
can use a maximum of 63 characters each. Practically this restriction
does not loom large so far.
To avoid exceeding the limit of 63 characters when choosing a domain name,
you should always consider the fact that the Punycode format of the name is
usually longer than the original format. Based on the number of special
characters used, the length of the Punycode format can easily double by the
conversion.
It is not easy to predict, as the following examples show. Even if the
original names are of the same length, different Punycode lengths may result:
Original
Length
Punycode
Length
mücke
5
xn--mcke-0ra
12
äpfel
5
xn--pfel-koa
12
ölscheich
9
xn--lscheich-m4a
16
übeltäter
9
xn--beltter-8wa5s
17
Strange characters
The letter “ß” has a special role in the conversion.
It does not lead to a name which starts with the prefix
“xn--”. In fact an “ß” is converted into an double-s “ss”.
Have a look at the
example
in the table above.
The reason is, that “ß” from a linguistical point of view is not a special
character but a ligation. Ligations are characters which were created by
merging two other characters. “ß” is composed of “s” and “z”. Less known
and nowadays barely known ligations are “fi” and “ffi”.
Some fruit are closely connected: Bundles
Besides the conversion of the names, another problem has to be solved.
Different spellings of words can mean the same thing.
If, for example, a normal character like “e” is a variant of “è”, then in French
the name of the domain for the swiss town geneva can be written in two different
ways:
geneve.ch
genève.ch
That is the reason why the respective registry has to determine how to handle
these kind of variants. In principle, there are the following possibilities:
If a registrant has registered one variant, then he also holds all
other variants at the same time.
If a registrant has registered one variant,
all other variants are reserved for him. He has to pay individually for
each of the other variants, however, if he intends to actually use them.
Only the variant as exactly registered is owned by the registrant.
Other variants can be registered by different registrants.
The automatic registration of further variants when registering one variant is
called “bundle”.
Since the composition of a bundle strongly depends on the language,
it is internationally prescribed that the language must always be
specified when registering Internationalized Domain Names.
There will be no bundles for .de domains, however. For that reason,
the language field will be automatically set to “ger” or “de”, respectively.
For domains from the so-called CJK area (China, Japan, Korea), managing bundles
is a rather complex subject due to the large number of characters with
simultaneous yet differing usage in the different languages, and consistently
leads to legal conflicts.