UTF-8

   

See also [ext] http://en.wikipedia.org/wiki/UTF-8

UTF8 is described at [ext] http://www.unicode.org/ - there is a PDF file at [ext] http://www.unicode.org/versions/bookmarks.html that will tell you all you need to know about it.

UTF8 is probably the best overall character encoding system around these days, because it combines these features:

  • Fully backwards compatible with ASCII - when you see a byte in the 0..127 range, it always represents the equivalent ASCII character.
  • Can represent any character needed for any major language (and a lot of minor languages and symbol sets).
  • No need to pick your character encoding based on language.

It does have drawbacks, but if you need the ability to represent a large number of languages in a stream, it is quite good.

The primary drawback of UTF-8, as opposed to fixed-length character systems, is that a text with many non-Latin characters will take up much more space in UTF-8 than in, say, UCS-2. UTF-8 can use anywhere between one and six bytes for each character, as opposed to the fixed two-bytes-per-character ratio in UCS-2. However, UCS-2 (what most people think of when they hear "Unicode") isn't backward compatible with ASCII. -- BenjaminGeiger

This is not correct. UCS-2 (more commonly called UTF-16 now) will take either two or four bytes to represent a character, so it is not fixed width either, and whether UTF-16 or UTF-8 is "shorter" depends totally on the characters used; there are several groups of non-latin characters that are two bytes per character on each, all ASCII is shorter on UTF-8, and some non-ASCII characters are shorter on UTF-16. Personally, I think UTF-16 is pretty useless; originally it was supposed to be 2 bytes per character, so it was going to be simpler in some ways than UTF-16, but that didn't work out (too many characters), so now it has all the complexity, none of the backward compatible advantages, and only sometimes saves a few bytes in representation format. Ugh. If you really need fixed bytes-per-character then UTF-32 is the way to go, but that is almost always bigger overall than a UTF-8 stream and of course has zero backward compatibility.

KGS and CGoban 2 both use UTF-8 whenever they write SGF files.



This is a copy of the living page "UTF-8" at Sensei's Library.
(OC) 2005 the Authors, published under the OpenContent License V1.0.
[Welcome to Sensei's Library!]
StartingPoints
ReferenceSection
About