Experience with new rating system [#740]
18.104.22.168: Experience with new rating system
(2006-11-12 09:08) [#2382]
As "sum", from ca. mid October to mid November I have made a serious attempt to promote in rank from 4d to 5d. Typically I have played ca. 10 games per day on average with an estimated winning percentage of ca. 75% on average. The effect has been a rating increment of ca. 15% of one rank in the upper 4d range. On one day, I played 11:2 and my rating changed as you would expect for playing 2:11. On another day, I played 8:4 and my rating changed as you would expect for playing 4:8. On the remaining days, my relative rating changes looked rather normally for my achieved winning ratios. My games were mostly against frequently present KGSers.
What does this mean? It takes ca. 7 months to increase one rank in the middle or upper dan ranks. Besides the rating system is flawed, or it would not sometimes lead to rating changes apparently contrary to one's win-loss-ratios.
This means that the rating system is sincerely flawed. 1) It ought never to show exceptional, unpredictable behaviour contrary to every intuition. 2) Still the rating changes are by far too slow. One cannot know one's true rating unless a) one plays for literally years several games every day or b) one enters the server with a new user name because ratings change much faster for players with only few rated games altogether.
22.214.171.124: Re: Experience with new rating system
(2006-11-23 10:14) [#2441]
Now I have more precise statistics. Where necessary, I have interpreted the ratings graph of "sum" by visual perception using a graphics editor's ruler. The numbers of games are counted. Here are the results: From October 25 2006 to November 22 including, i.e. during 29 days, I have played on 22 days more than a tiny or zero number of games. On these active days, I have played almost 18 games per day. My start rating was ca. 4,757d, my end rating ca. 5,000d. This equals an improvement of ca. 0,243 ranks. I played 392 rated games altogether. Of these 5 were jigos. This leaves 387 rated games with wins or losses. Of them there are 262 wins to 125 losses. This gives 67,7% wins to 32,3% losses. My opponents were between 3d and 7d. Almost all games against 4d were even, i.e. mostly 7 komi under New Zealand Rules, 7.5 under Chinese, 6.5 under Japanese. My games against 3d were mostly played on even, i.e. with komi. My games against 5d-7d were mostly played with handicap. A tiny number of games were ended by internet trouble and / or escaping; this may be ignored. More relevant is the handicap choice in my favour, i.e. the 67,7% wins is not as impressive as it might appear at first.
However, having to play 392 rated games to improve the top 0,243 ranks of the 4d rank is clearly by far too many. This is possible only over a long period of time or if you play on KGS all day and night (unless you win a few titles in real world tournaments in between...). Having had to play for so much every day meant a severe disadvantageous effect because of becoming tired. On a number of days, I would begin 6:0 to 9:0 in the morning and then start losing during the afternoon.
To conclude, the rating system's parameters ought to be changed a) to allow much faster rating changes, b) to weigh the history of old games not at all or at least much less, and c) to avoid the frequent players' unfair disadvantage of having played more old games than infrequent players and thus having to play more new games to achieve the same rating improvement.
(2006-11-15 09:38) [#2402]
I think you totally misunderastand certain aspects of the rank system. It is working as designed. You are just upset because it isn't working as you want it to.
1) The rank system assumes that players are playing there best over time, and is tuned to properly rank such players. Your account had a lot of games over along period of time. Suddenly you started making "a serious attempt" to get a new rank, and started winning 75% of your games (you won 50% before). 75% means a little bit more than 1 stone stronger than your rank, and dan players simply do not become more than a stone stronger overnight. You switched from playing casually to taking your games very seriously, so you appear to be a stronger player than you just were, but the rank system is not tuned for such players, so it is taking a long time to get you to the proper rank. I see that not as a weakness of the rank system, but as a massive oversight in the way that you are evaluating the rank system.
2) You get upset when you win a certain number of games, then the next day your rank isn't adjusted appropriately. The KGS system makes no attempt to "reward" you for good play. It attempts to constantly evalute your performance over the past 6 months and find a rank for you that fits this history. A day's results matters very little here.
There are other issues I have, but these are the main ones. You can call the goals of the KGS rank system flawed, but from the data you have here, the implementation cannot be called flawed, only the testing methods.
126.96.36.199: Re: Response
(2006-11-16 00:06) [#2406]
Any rating system should have a behaviour that can be understood intuitively all the time. A rating system that reevaluates old games escapes intuition because as a player one does not have a long-term overview on all old game results and exact opponent strengths. The KGS rating system may be working as designed - but players cannot judge about the quality of the ratings it calculates because they cannot perceive enough information about the relation between game results and rating changes. Indeed the rating system is thus not working as I want. I as a player want to be able to understand already intuitively why a rating changes by roughly a particular amount. Every player should have this desire. Otherwise the rating system is nothing but a black box that is to be trusted by definition. A rating system is not good because its maths lets it look like a rating system in theory - but can be good because it produces value changes for that every observer would say: This is what I have expected! Ratings must be predictable on the surface, not just deterministic in the hidden calculations.
A rating system should not have the assumption that players are playing their best over time because not playing best for some time and then playing best is, from a neutral point of view, nothing else than a quick strength improvement. A rating system must be able to deal well with sudden strength improvements because it sometimes does happen that players become stronger suddenly and rapidly.
The previous day's results matter quite a lot: On most days, the rating change strongly corrolates to the previous day's winning percentage.
188.8.131.52: Re: Response
(2006-11-18 00:58) [#2421]
I like the KGS rating system, I understand how it operates, and it reflects the changes in my strength. The new version also allows me to play handicap games, which is great.
Rapid changes in strength at 5d are abnormal. The EGF system has slower changes around this rank than 15k. 19 games out of several hundred has a small weighting on KGS, what's strange about that?
: Re: Response
(2007-07-02 14:14) [#3515]
I'm with sum on this subject. Intuitive understanding of rating system is important. How could you justify why my strength a couple of months ago has anything to do with my strength NOW. Of course the amount of games won over a couple of days counts. If you think it doesn't, then I could only infer that you want to make the rating of each person more reliable. Yet at the same time you don't trust anyone's current rating, because a 4D has 80% in 2-3 days doesn't mean he's any stronger than the opponents. How does that work? Faster promotion has a good side, it encourages people. If the system is really reliable, when 4D got promoted to 5D and he's not really 5D, he'll start losing games and demoted back to 4D again pretty soon. People will get a real feel of their strength. And it stops people signing up for multiple accounts, which I can only see it as a benefit.
: Re: Response
(2006-11-22 03:23) [#2438]
I think you totally misunderastand certain aspects of the rank system. It is working as designed. You are just upset because it isn't working as you want it to.
I also cannot quite follow Sums "a rating system should be understand intuitively all the time"-argument. But I think what he means is, that there should be at least a visible effect, for a lot of wins. If the effect is only in the far future its difficult to keep the motivation for good play or for playing at all.
And if you can win 11 Games in a row as a 4,95 Dan and don't promote that's just hilarious.
75% means a little bit more than 1 stone stronger than your rank, and dan players simply do not become more than a stone stronger overnight. You switched from playing casually to taking your games very seriously, so you appear to be a stronger player than you just were, but the rank system is not tuned for such players, so it is taking a long time to get you to the proper rank.
We are not talking about one stone difference, we talk about 0,05 Stones and I cannot see why it should be that difficult to take that hurdle.
You rescaled the rank system, wich was quite a good idea. But maybe you should also reconsider the other parameters, since this chance probably had hidden implications.
At least it would be great if you can make the rank system more transparent. As I understand it, even the calculations of yoyoma are only guesses on hidden parameters.
184.108.40.206: ((no subject))
(2006-11-15 18:55) [#2405]
sum, I did some math a long time ago regarding this at: http://senseis.xmp.net/?KGSRatingMath#toc4
Those tables were done with the old constants of k=0.8 and halflife=45. I think these are the constants being used now in KGS3 also (I emailed kgs admin, got a reply, I'm not quite sure if there is a "minimum probability of win" factor anymore?). So in short, if you suddenly go from winning 50% of even games to 69%, and keep playing games at the same rate, it will take 45 days to promote to a weak 5d.
Your estimation of 7 months assumes the increase would be linear, but in fact it's not linear due to the exponential decay of the weighting of old games.
Please note that half of that page I pointed to still has the previous constants, sometime I might update the page to reflect the current status.
220.127.116.11: Rank vs. Rating
(2006-11-16 11:58) [#2409]
I think you have to differentiate between ranking and rating systems.
A rating system assigns a number to each player, adding or subtracting points after each game, possibly depending on the number of the opponent. The most common types of rating are some derivative of ELO, e.g. the EGF rating list.
A ranking system assigns a rank to each player, computing it anew whenever the need arises, from the results against and ranks of the respective opponents. This is what players do when they assign ranks to themselves based on their results. This is done in a more defined way by the KGS ranking system.
I find the mathematical concept of the KGS ranking system and the notion that "you get a rank assigned that maximizes the probability of your results" quite graspable.
What sum seems to want is either a rating system or a drastically reduced halflife of the games' weights in the KGS ranking system.
18.104.22.168: Re: Rank vs. Rating
(2006-11-16 12:19) [#2410]
KGS uses a rating system and derives ranks from it. This is useful in principle.
A significantly reduced halflife would be a great improvement; no halflife at all is want I want, i.e. only the newly played games are taken into account for new rating changes. (This also avoids a side-effect of being affected seriously in an unpredictable manner from inverse sandbaggers.)
However, first of all I want a rating system that has an intuitively predictable behaviour so that every player can immediately appreciate the quality of the system.
"you get a rank assigned that maximizes the probability of your results" is quite graspable in theory, but cannot be understood for a particular player at a particular time just by one's intuition.
22.214.171.124: Re: Rank vs. Rating
(2006-11-17 15:27) [#2417]
No. There is no "hidden rating" from which ranks are derived. Every game newly entered into the system affects the ranks of all players since the ranks are continuously balanced to maximize the probability of the results. The ranks have an intern numeral representation for computational ease, different from the displayed rank, but that is irrelevant for the notion of "rating" or "ranking" by my definition.
Reducing the halflife to zero is nonsense of course, because then your rank would switch back to [?] within the second you played each game. Reducing the halflife to 1 day (as an example) results in your rank becoming [xx?], then [?] again within days or few weeks of not playing rated games.
: Re: Rank vs. Rating
(2006-11-17 18:38) [#2418]
Actually, that's neither here no there. If you want to, you can keep uncertainity statistics (and optionally display them as a [?] or other form) and still weight only the most recent game. With a zero half-life your rating would jump around a lot without other modifications, which annoys a lot of players. The current IGS system solves this by having different promotion/demotion thresholds. It's more intuitive than the KGS approach.
The number of wins in a row to promote has always seemed high to me. How many games in a row would you lose to a club player without offering to alter the handicap? Probably less than 15.
: Re: Rank vs. Rating
(2006-11-17 20:37) [#2419]
15 wins in a row only applies if you played 1 game a day for the last 180 days. For most people who don't play so much it will be less. I think it should be linear, although I didn't actually do the math... (if you played a game every other day, you probably need 7.5 wins in a row to go from 2.5d to 3.0d).
When Robert suggested zero half-life, he probably meant some system that takes your current rating and the result of the game you just played (result, handicap/komi, opponent's rating vs your rating) and calculates a new rating from this. Like Elo or Glicko for example.
126.96.36.199: Re: Rank vs. Rating
(2006-11-19 14:57) [#2428]
If you understand how it operates, then tell the non-mathematicians among us how to verify the quality of a particular player's rating at a particular time.
Every rating system can be made to include handicaps.
It is immaterial if some players' behaviour is rare. If the system does not model also them well, then the system is a bad model. Each player has the same right to be modelled well. Otherwise the system favours players that happen to behave as the system wants.
As to the "19 games", a player's strength changes over time; his new games show his new strength - not his (rather) old games.
: ELO vs. KGS
(2006-11-18 00:22) [#2420]
Robert, as I understand you, you're basically asking for something like ELO (rather than KGS with 0 halflife). I'd regard the rating system used by KGS as the more advanced one, even though it comes at the cost of less intuitive predictability. Both ELO and the KGS system adapt to the probabilties of the actual results. Unlike ELO though, KGS also takes their varying significance into account.
With ELO, a win against opponent C, whos rating is quite uncertain (say, C barely got a solid rank), has the same influence on your rating as a win against opponent B, who has been playing consistently at that level for years (and therefore, a sharply defined rating).
KGS weights such results accordingly: the game with B tells more about your strength than the one against C, so it will affect your rating more. A few weeks and bunch of games later, C may e. g. turn out to be stronger than what was believed at the time you played each other. KGS ratings are steadily recalculated as new data arrive, hence in such a case, your win against C will get additional weight. The same holds for downwards corrections. On the contrary, ELO points couldn't make use of this information at all. They are so nicely understandable because they never adjust - even when they're far off the actual data.
In my view, the main purpose of a rating system is to allow most well balanced games, whilst the "status symbol" aspect of ranks is secondary. Of course, that is a matter of taste. I'd agree though, considering your recent win ratio, the currently used KGS parameters feel rather inert.
Anyway, if you're interested, take a glance at the KGSRatingMath page cited by yoyoma above, where you also can find a section on "the math behind, made easy".
188.8.131.52: Re: ELO vs. KGS
(2006-11-19 14:49) [#2427]
ELO is a system without halflife, but I do not ask for specifically ELO, although ELO is already much better on the intuition / predictability side. Derivates of, e.g., ELO are a possibility though. You mention shortcomings of ELO, but it is not necessary to keep them in an ELO derivate. E.g., the EGF rating system is an ELO derivate and more sophisticated than pure ELO. For shortcomings, see my old articles on rec.games.go. However, it is much easier to improve ELO derivates than maximum likelihood derivates because the latter do not even meet one of the basic requirements, intuition, which all rating systems should fulfil.
In which sense do you consider the KGS rating system "more advanced"? Maybe than pure ELO, but how do you compare it to ELO derivates? Clearly the KGS rating system is not advanced enough to allow an intuitive understanding of a particular player's rating at a particular time or an easy verification for non-mathematicians about the degree of quality of that.
184.108.40.206: Re: ELO vs. KGS
(2006-11-20 12:27) [#2432]
So, returning to my definitions, you want a rating system where a ranking system is implemented, based solely on your lack of intuitive understanding for the cause-effect relation under the current system.
I assumed a rather profound mathematical background on your part, so I conclude that you are not argumenting for yourself but for a perceived public that is in the projected intellectual situation.
I think that "intuitive" is not a good criterion, because it depends on the intellectual level of the subject. For any mathematician or natural scientist, the "maximum likelyhood" concept is "intuitively" clear. For the rest, all systems that are in any form mathematically "advanced" will remain a mystery. For these, many aspects of modern life are already a black box.
Having the ranking system of KGS be a black box for the not so scientifically inclined part of the users is not a problem in itself. It is sufficient to let them know that they are constantly compared to the current level of their former opponents, so that they can see why not every game they play has the same effect and that their rank will change a bit even if they don't play.
The ranking system of KGS is quite unique in that it actually tries to depict what the ranks should be. It would be a great loss to abolish this just because someone doesn't understand it.
I think it would be quite interesting to make a KGS-system group of games only between bots. The bots' strength does not change, so theoretically, their rank should stabilize at quite a distinct value after a suitable number of games.
: Re: ELO vs. KGS
(2006-11-20 14:59) [#2433]
Just so to clarify: 220.127.116.11 isn't me. I'd appreciate if the author of above reply indicated some identity beyond the mere IP.
Concerning the relevance of intuitive understanding, I tend to disagree. I do see every obfuscation as a loss, but in my view, this one is outweighed by its gains.
18.104.22.168: Re: ELO vs. KGS
(2006-11-20 15:21) [#2434]
A ranking system is not necessary; it is just a convenient abbreviation for being in a certain range of ratings. If there are no ranks, then instead of "it takes time to achieve 5d", one would have to say: "it takes to time to achieve the rating 2400 (or whatever)".
Not "based solely on" something. A rating system should be based on several aims.
For your reference, I studied some mathematics at university.
I agree that a rating system's criteria should not depend on varying intellectual levels of observers. In this sense, "intuitive" is indeed not a good criterion. However, this simply means that, for the purpose of integration into a rating system, "intuitive" aka "easily predictable" must be defined. For a start, it should contain aspects like "more wins than losses of newly played games means a rating increment", "the rating shift for winning/losing a game is public before, during, and after each game", etc.
Let me repeat: I have not referred "intuitive" to the concept "maximum likelihood" but to each particular player's rating (change) at a particular time. For the public, "maximum likelihood" is immaterial while the ratings themselves matter. Likewise, assumed probablities are immaterial for them while they would want to know a game's possible rating effect.
A black box system is not good just because "there are many in the world". A system is good if it allows everybody to judge about its quality when it comes to its application to each particular player's rating (change) at a particular time. Besides there should be a theoretical, general study of the quality of rating changes for an arbitrary particular player.
It is absolutely insufficient to let the public know that they are constantly compared to the current level of their former opponents. The public has a right to be given the possibility of rather easily judging about the quality of each particular player's rating (change) at a particular time. The public does not need to understand the programming interna of a rating system - but the public ought not to depend on pure trust towards the rating system designers / managers about the ratings quality.
Inhowfar is the ranking system of KGS is quite unique in that it actually tries to depict what the ranks should be? Many rating systems that have a derived ranking scale do this.
22.214.171.124: Judging the quality of rating or ranking systems
(2006-11-20 21:35) [#2435]
Warning: lengthy text. You may want to skip to '---' for a new proposal.
In order to judge the quality of rating or ranking systems, we first need some sort of definition of this "quality".
What is the quality of a rating or ranking system?
Basically, it is the difference between the "true" strength of a player and the strength the system finds for that player. Of course, we have to add all those differences for every player in the system. Mathematically speaking, a good formula for this might be the sum of all squared errors, divided by the number of players in the system.
Let's look at two different systems: the EGF rating ( http://gemma.ujf.cas.cz/~cieply/GO/gor.html), and the KGS ranking ( http://www.gokgs.com/help/math.html). I want to use the EGF rating in its simplest form first, omitting any rating-dependency of the variables ('con' and 'a') and 'e'.
The simplest scenario would be an unchanging player base, i.e. a player base where no new players are added, all players play the same amount of games over a given period, none drops out, and all players do not change in strength (like bots).
When the EGF rating system is used on this scenario, it is quite conceivable that the players' ratings will approach their true strength, oscillating around it after equilibrium is reached. The amplitude of this oscillation depends on the average rating change for a single game, the oscillation being smaller the smaller the change is. In the region of 2100 EGF rating, this change is about 12 points. The average error during the oscillation can be deduced from this and is somewhere between 10 and 20 points (my estimate). 100 EGF rating points correspond to 1 stone strength difference, so the average statistical error is around 0.1 to 0.2 stones of strength.
When the KGS ranking system is used on this scenario, then the players' ranks will approach their true strength, oscillating around it after equilibrium is reached. The amplitude of this oscillation depends on the relation of the average time between games 'tg' to the games' weights' halflife 'hl'. If tg is much smaller than hl, then the oscillation will be very small. Take the rank graph of a KGS user who plays 1 rated game per day to get a picture of this.
We now change the scenario in such a way that exactly one player suddenly becomes exactly one stone stronger.
Then the EGF rating of this player will start to rise until it reaches an average value a bit less than 100 points above his former average rating. The other players' rating will at the same time decrease by a tiny bit. The time until the new equilibrium is reached also depends on the average rating change per game, being longer the smaller the change is (approximately proportional to the difference of new and old strength divided by the average rating change).
The KGS ranking of this player will start to rise until it reaches an average value a bit less than 1 rank above his former rank. The other players' ranks will at the same time decrease a tiny bit. The time until the new equilibrium is reached also depends on the relation of the average time between games 'tg' to the games' weights' halflife 'hl'. If hl is much longer than tg, then the change will take a longer time.
We see that both systems have a factor influencing both the average oscillation error and the time it takes to reflect changes. This factor has to be fine tuned to get the best results out of a given realistic scenario.
The KGS ranking system is directly based on what is tries to depict. Strength means the ability to win games. Strength differences (handicaps included) result in certain winning percentages. By measuring the percentage, the strength differences are calculated, thus giving a complete feedback. This value is always the best guess the system can make, the maximum probability being at exactly this value.
In contrast, the EGF rating is based on arbitrary numbers. Differences in strength result in numbers being subtracted or added and the statistical effect of a sufficiently large number of games is supposed to give a good value on average. The maximum probability of the true strength's representation is differing from the current rating by the standard deviation resulting from the rating oscillation.
The problem with the KGS ranking system is that the time to reflect a change also depends on the average time between games, resp. the number of games you have played. You can't play 10 games per day first, then become stronger and expect the rank to reflect your true strength immediately while you only play 2 games per day. You have "nailed" your previous strength into the statistics. The EGF rating change does not depend on the number of games you played before.
I think that currently KGS is working as expected, but I can not dismiss Robert's complaint. I think a solution might be to reduce the games' weights 'W' not only by time but also by the number of games 'g' played afterwards.
The current formula would be something like W = W0 * (45/(45+age)). My proposed formula would be something like W = W0 * (55/(55+age)) * (10/(10+g)).
The numbers require tweaking, of course.
The KGS math page only states that the weight depends on how old the game is. I think the above is not implemented, but I could be wrong.
: Re: Judging the quality of rating or ranking systems
(2006-11-25 02:42) [#2443]
Time has the huge advantage of synchronicity. Unless you introduced asymmetric weights (which I'd strongly object), a weight decay based on the number of games played since could have surprising side effects.
Whenever there happens to be an accidental correlation amongst your opponents between "defeated/lost to" and "frequently/rarely playing", your rating might change significantly, for no apparent reason.
In my view, a moderately shortened halflife time would fix the flaws complained about here quite well.
(2006-11-22 13:24) [#2439]
Sorry, I always forget to login. 81.173.x.x is me.