Somebody put up this page with a question to "Describe UTF8 here". No idea why, but here goes:
That's the default text when someone creates a new page. If someone edits a page that doesn't exist, then saves it, it'll read "Describe PageName here." -- BenjaminGeiger
Thanks.
UTF8 is describe at http://www.unicode.org/ - there is a PDF file at
http://www.unicode.org/versions/bookmarks.html that will tell you all you need to know about it.
UTF8 is probably the best overall character encoding system around these days, because it combines these features:
It does have drawbacks, but if you need the ability to represent a large number of languages in a stream, it is quite good.
The primary drawback of UTF-8, as opposed to fixed-length character systems, is that a text with many non-Latin characters will take up much more space in UTF-8 than in, say, UCS-2. UTF-8 can use anywhere between one and six bytes for each character, as opposed to the fixed two-bytes-per-character ratio in UCS-2. However, UCS-2 (what most people think of when they hear "Unicode") isn't backward compatible with ASCII. -- BenjaminGeiger
This is not correct. UCS-2 (more commonly called UTF-16 now) will take either two or four bytes to represent a character, so it is not fixed width either, and whether UTF-16 or UTF-8 is "shorter" depends totally on the characters used; there are several groups of non-latin characters that are two bytes per character on each, all ASCII is shorter on UTF-8, and some non-ASCII characters are shorter on UTF-16. Personally, I think UTF-16 is pretty useless; originally it was supposed to be 2 bytes per character, so it was going to be simpler in some ways than UTF-16, but that didn't work out (too many characters), so now it has all the complexity, none of the backward compatible advantages, and only sometimes saves a few bytes in representation format. Ugh. If you really need fixed bytes-per-character then UTF-32 is the way to go, but that is almost always bigger overall than a UTF-8 stream and of course has zero backward compatibility.
KGS and CGoban 2 both use UTF-8 whenever they write SGF files.
(Sebastian:) This page was apparently created by accident. The text "Describe UTF8 here" is the default text for creating a new page. It was apparently created when someone clicked on the SGF code in the referring page. This is another argument for my wish in GuineaPigsFeedback that SL should be able to store non-wiki text files.
I propose we should delete this page and replace it with a link to http://en.wikipedia.org/wiki/UTF-8. Before that, it would be great if the contributors compared that page with this one and if there's any information missing in WikiPedia, either edit their page directly or write it as a comment in this page, and I or someone else will add it to Wikipedia.
wms: Heh, so sensei's library has a built in trolling feature to find people too eager to be helpful, and I took the bait. :-) I think that the wiki page says just about everything that is said here, and then some, so I propose that we get rid of this. And yes, it would be nice to be able to have non-wiki text files embedded in Sensei's.
(Sebastian:) Actually, I have to apologize. I am ashamed to admit that didn't even look at what you had written - it reminded me too much of work. Now that I did I find this page more concise than the Wikipedia article, above all because you dare to express your opinion here. Comparing these two articles is interesting when one wants to understand how wiki wiki webs work. Can it be that NPOV has its drawbacks? Is SL's less neutral point of view even an advantage?
I am only unsure what to do with the UTF-8 text. Maybe we should reserve a place for it here, after all?