UTF-8
See also http://en.wikipedia.org/wiki/UTF-8
UTF8 is described at http://www.unicode.org/ - there is a PDF file at
http://www.unicode.org/versions/bookmarks.html that will tell you all you need to know about it.
UTF8 is probably the best overall character encoding system around these days, because it combines these features:
- Fully backwards compatible with ASCII - when you see a byte in the 0..127 range, it always represents the equivalent ASCII character.
- Can represent any character needed for any major language (and a lot of minor languages and symbol sets).
- No need to pick your character encoding based on language.
It does have drawbacks, but if you need the ability to represent a large number of languages in a stream, it is quite good.
The primary drawback of UTF-8, as opposed to fixed-length character systems, is that a text with many non-Latin characters will take up much more space in UTF-8 than in, say, UCS-2. UTF-8 can use anywhere between one and six bytes for each character, as opposed to the fixed two-bytes-per-character ratio in UCS-2. However, UCS-2 (what most people think of when they hear "Unicode") isn't backward compatible with ASCII. -- BenjaminGeiger
This is not correct. UCS-2 (more commonly called UTF-16 now) will take either two or four bytes to represent a character, so it is not fixed width either, and whether UTF-16 or UTF-8 is "shorter" depends totally on the characters used; there are several groups of non-latin characters that are two bytes per character on each, all ASCII is shorter on UTF-8, and some non-ASCII characters are shorter on UTF-16. Personally, I think UTF-16 is pretty useless; originally it was supposed to be 2 bytes per character, so it was going to be simpler in some ways than UTF-16, but that didn't work out (too many characters), so now it has all the complexity, none of the backward compatible advantages, and only sometimes saves a few bytes in representation format. Ugh. If you really need fixed bytes-per-character then UTF-32 is the way to go, but that is almost always bigger overall than a UTF-8 stream and of course has zero backward compatibility.
KGS and CGoban 2 both use UTF-8 whenever they write SGF files.