RobertJasiek: The rating system's global stability means that 1) the ratings of all players considered together do not drift and 2) the distance between every two ranks remains constant. I hope that (2) is ensured already by definition. In principle (1) is or can be affected by a) players leaving the population, b) players entering the population, c) asymmetrical rating changes of every single game. I do not know for sure if (a) and (b) occur in this rating system, but I think that yes. (c) can have several causes. Currently we are not sure if it is affected by the confidence weight. (c) would be affected if we should change to "dependency on game number". - Currently there appears to be a rating drift anyway, and this is another problem to be solved, regardless of whether we use (something like) "dependency on game number". So, as far as I understand the rating system now, using such would not create an extra principle problem, but only possibly increase the urgency to solve the global rating drift problem anyway. - It is fairly easy to counter a global rating drift, as follows: Each game means a preliminary rating change of c1 for player 1 and c2 for player c2. Altogether the preliminary rating change per game is c := c1+c2. For the global change, sum up all games' preliminary changes c, giving the sum C. Next this C is redistributed over the all player population. There are several ways to redistribute. One way is to redistribute only to the winners of games if C<0 (smaller than 0 because the negative of C has to be redistributed) and only to the losers if C>0 (greater than 0 because the negative of C has to be redistributed), ensuring for each player a rating increment if he wins or a rating decrement if he loses, considered per game. Such a modifying shift can be redistributed either as an absolute fraction of C for each affected (because either won or lost) game or as a relative fraction of C: the greater either c1 or c2 is (the one of the two which is modified at all), the greater the modifying fraction of C applied to it becomes. E.g., linearly. - Doesn't this solve any rating drift problem?
Robert, regarding rating drift, I think there are a few things this can refer to.
My guess is that when you said "there appears to be a rating drift anyway" you were observing the local drift and the settling from the change from KGS2 to KGS3. I don't think we need anything like you propose for global rating drift correction.
The issue that it takes too many games to change your rating if you played often in the past is a separate concern which I will talk about in a separate post.
- yoyoma
quote:
- Local drift. Since KGS constantly reevaluates your rating even when you don't play games, if your previous opponents collectively improve and increase their rating, your rating will go up as well.
Is this really true? Please explain why the current progress of the opponents you played in the PAST affects your rating now at all. It does seem that this is the source of global drift. It's against common sense as well. the opponent i played keeps winning, that means I improve? nonsense.
Explanation for my factor: the proposed factor is the simple way to let the weight decay exponentially.
Currently, as far as I know, the following weight factors apply:
My proposal is to include another assymetric factor:
In order to make the games' weights not decrease too fast, the time halflife and the gamenumber halflife would have to get higher than if they were applied without the other. A first suggestion could be 90 days and 60 games, but this has to be tweaked. Another possibility is to keep the old time halflife but to take the square root of both factors:
W = W0 * sqrt( H / ( H + age ) ) * conf * sqrt( X / ( X + num ) )
('W' weight, 'W0' default weight, 'H' time halflife, 'age' the game's age, 'conf' opponent's confidence factor, 'X' gamenumber halflife, 'num' this player's number of games since)
As for Robert's categories of new methods:
I think either 'a' or 'e' would address the problem in a helpful manner. 'e' seems a bit blunt to me but is the simplest way. 'a' could be more elegant but is also more resource-consuming.
RobertJasiek: I have a few questions about your comments: Can you explain your remarks about (b) and (c) in more detail? I do not consider (d) a joke. If the old data can be used (because they go in the same direction as the new data), then they improve the significance. If the old data cannot be used, then the proposal is like (b). Can you elaborate more on why and how (e) is blunt and how it compares to (a)? (f) is not (c) in other words because (f) might also use other functions, e.g., for every e(-ax) function for some suitable a>0, but also other exponential, monotonously dropping functions are possible. Ok, (f) and (c) are closely related, or we could simply consider a broader set of possible functions in (c).
Harleqin, about the current effect of rank confidence on ratings, I disagree. As far as I can see, it does *not* constitute asymmetric weighting. Each game result is weighed according to both players' rank confidences, which means, symmetrically.
In the calculation of a player's rating, the factor which stands for that player's own rank confidence can just be omitted, because it is constant over all results, and therefore, cancelles out.
If your games were always weighted by your own rank confidence, then you could never get away from 0 confidence.
I think that even if this is countered by the lowest possible confidence being greater than 0, the system would show strange behaviour: it has no confidence in your rank, therefore it doesn't believe your results (?!).
I am sure that someone has a real insight into the true workings of KGS and can tell us.
This nonwithstanding, I still don't see the looming instability from assymetric weighting. Can someone either bring up empirical or simulated data, some ab initio calculation, or a structured reasoning showing this?
Right, there is a (removable) discontinuity at a rank confidence of zero.
I'm surely not the one you're asking for, since I haven't seen the code explicitely by myself. For the same reason, I don't know how the algorithm practically gets away from a player's initial zero confidence. Possibilities include
Because at a given point of time, the own weight factor is constant throughout one's record, even the second way doesn't affect the weighting symmetry in the main rating algorithm.
About the additional rules applying to nearly empty records (e. g. the 10? limit) I don't know enough to state anything, though.
A few clarifications:
weight=H/(H+age) is not exponential decay and has no half life. KGS does not use anything like this. Exponential decay means something like weight=e^(k*age), where k is negative.
The instability I mentioned is related to groups pushing themselves around to ridiculous ranks. You can get a situation where for some reason a group of people who play each other have their wins weighing more than their losses, so as a group they keep climbing in rank with each iteration. Even when this climb is bounded eventually it becomes very noticeable when you get a large enough group of players. I've fiddled with asymmetric weights in the past, and stuff like this kept happening. It doesn't happens when I use symmetric weights (unless I break something else that is). I don't plan on spending more time fiddling with asymmetric weights again.
RobertJasiek: This kind of instability would not be treated by correcting rating drift only globally, is important, and it does not seem like it could be treated by easy means. So now I understand why currently you object to asymmetric weights. - Thus can we discuss whether suggestions like (a), i.e. "dependency on game number", can be made working together with the symmetric weights requirement? A possible approach is: For a game, each player contributes 1/2 to the game's rating change. Each player's contribution can be calculated (also) due to his historical context (e.g., his dependency on his own number of his played games). Then the sum of the rating change for the game is formed, using both players' contributions. Then the rating change is symmetrical: plus half the sum for the winner, minus half the sum for the loser.
Robert, there is no such thing like a 'rating change per game'. All ranks are iteratively recalculated every few seconds by the aforementioned maximum probability mechanism, even if no new game results are added.
RobertJasiek: Ok. I am not used to thinking in terms of maximum likelihood yet. Maybe my suggestion works for ELO derivates. How to translate it for maximum likelihood systems?
I think it's possible to add number of games played as a factor in the weight and keep it symmetrical. Currently any game's weight is a function of (age, playerA's rating confidence, and playerB's rating confidence). So you can make it a function of (age, playerA's rating confidence, playerB's rating confidence, # games played by playerA since, and # games played by playerB since).
A drawback can be now if you are seeking to improve your rating quickly, you would want to avoid playing people who play often, because then that game would become weighted less as they played more. As an extreme example, suppose you beat someone and then they play 1000 games that day. Your victory will have almost zero impact on your rating.
- yoyoma
RobertJasiek: Concerning the application of weights, your description is roughly what I have imagined. - Concerning the drawback, it thus seems to be a good idea to combine (a) "dependency on game number" with (e) "game number cutoff". For each player, the same game number cutoff should be used. Thereby the drawback does not occur. (Players with still fewer played games are the "?" players, which defines them cleanly and makes it clear to everybody that a game against those players does not have such a great weight as a game against a regular player. A "?"-player's game number could even be shown by the server.)
RobertJasiek: I have to correct myself a bit: Using (e) together with (a) does not eliminate the drawback, but restricts it. - You are right that an opponent's high activity after the played game reduces its weight. However, at the same time a different opponent's low activity after a played game keeps a high weight. Every player should play against different opponents. So naturally an effect of his opponents' activities goes to an average. This is not bad. Besides recall that a player's own (a) weight contributes half while his opponents contribute only the other half, what further restricts the drawback if on average you really should have had bad luck with your opponents in that most of them would have above normal activity after you have played against them. - More fundamentally, reevaluation of games is the very purpose of maximum likelihood systems, so one can also simply call it a good feature that continuing activity of one's opponents give further causes for reevaluation. In other words, if your opponents are active, you should feel motivated to be more active, too, so that the system can evaluate more fresh data for you. One may dislike maximum likelihood systems fundamentally (as I do), but if they are being used, we have to live with their fundamental behaviour of frequently reevaluating everything.
If I understand you correctly, by cutoff you mean to abandon games beyond a certain age completely. Those games' results generally are still valuable data, since they have an - albeit weak - correlation with current strengths.
Their optimal weight would be inversely proportional to their variance, compared to current games. Assuming that the ratings as of two months ago told as much about the actual ratings one month ago as the latter now tell about current ratings, we naturally get to exponential weight decay of the underlying game results.
Dropping games beyond a certain age is required for limiting the computational efforts. However, artificially worsening the data set by needlessly ignoring significant information would be against the grain of the whole ML concept, which aims at determining the best estimate of current strengths available.
Regardless, I'd agree that the KGS ratings could have done better recognizing your last months' progress. Even with plenty data available, ML estimates can be disappointingly bad due to very unusual behavior or improper choice of parameters. In my guess, the offset was caused by a bit of both.
It's not difficult at all to counter these troubles and make the system more responsive (just shorten the halflife time sufficiently), but that comes at the cost of less accuracy elsewhere. Either, more seldomly playing participants will get a [x?], or, if the threshold is lowered accordingly, "solid" ranks will be less certain.
In my view, there is a chance that an exponential decay component depending on the order of games in a player's backwards history could make sense (with symmetrical weighting). One could argue that e. g. the hundredth last game in player A's record not only provides some information about A's current strength, but also implies a slight hint that A may have improved more than other players who have played just three games since then.
Although the unintended side effects mentioned earlier certainly would occur, they might be outweighed by the gains, given that the "halflife count" is not too small. In the rating calculation of player A, a game x with opponent B then should be weighed according to
w(x) = confB * 2^-(time(x)/t_half + (nA(x) + nB(x))/n_half)
where
time(x) is the time passed since x nA(x) is the number of games played by A since x nB(x) is the number of games played by B since x t_half is the time after which a game assumes half weight (all other factors being equal) n_half is the time after which a game assumes half weight (all other factors being equal) confB is the weight factor based on B's rank confidence (based on the second derivative of the previously determined probability density of B's rank distribution at its maximum)
(Again, confA is constant over A's games and omitted.)
I have no clue what value of n_half would be appropriate, though. Currently, it is effectively infinite. Perhaps that is best already?
wms, sorry, I was being stupid. By the way, your k=-(ln 2)/H.
hgmichna: I haven't seen any good proposals on countering the long-term rank drift yet, so let me make one.
The idea is that ranks should be realigned relative to known strong anchor players. It is like letting their rank "trickle down" through the layers of weaker players.
This could be done in every game, but I think there is a much simpler way. Whole ranges of ranks could be automatically realigned monthly (or perhaps weekly for weaker players).
The method is actually very simple. Take a range of ranks, like 3d to 1d, and calculate the winning to losing ratio of all those games that were played against stronger players. If the winning to losing ratio is 1, all is well.
If more games against stronger players were won than lost, raise the ranks of all players in the range by an appropriate amount. Conversely, if more games were lost, lower the ranks. Then add or subtract the same amount from the ranks of all players below the range as well.
Then take the next range below, (1k to 3k), and repeat until all players are readjusted.
The "rank nudge" amount should be smaller than the theoretically required one to achieve balance, relying on the fact that there will be the next adjustment soon enough. The factor could be adjusted after observing the system in action. The more often this adjustment is done, the smaller the factor can be. If a very small factor were used, the adjustment could even be done daily, but rank drift is probably not that fast.
It is not necessary to make any adjustments within each dan or kyu range, because the existing ranking system takes care of that very well.
I don't know what the optimal ranges are, but there is no need to strike any mathematical optimum. The system is obviously stable anyway, if the few parameters are not choosen way out of all common sense.
Probably a mathematician could come up with a more elegant algorithm to achieve the same thing, and I would appreciate that very much, but at least the idea is now on the table.
As far as I know, the algorithm currently used at KGS (ML with anchors) does this already - in a very smooth manner.
Pure ML merely gives relative ratings. They are recalculated many times a day. Based on a set of anchor players, those relative ratings then are translated to an absolute scale.
Thereby, global rank drift is limited to the drift of the average anchor.
Sounds very good, and sorry I didn't know. I think the documentation of this part is not publicly available, or is it?
I think it is an interesting question whether the entire relative rank system is good enough, so it can be seen as a complete block of correct relative ratings from the strongest down to the weakest players.
Is there really no rank drift between strong and weak players? Does KGS really not need any trickle-down effect to realign lower ratings? Consider that there are very many more weak player games than strong player games. Unless there is some very clever weighting, the weaker players would dictate the stronger players' ratings, the tail would wag the dog.
But perhaps it still works. Perhaps it doesn't matter if the tail wags the dog, as long as the dog is led on a short leash. I don't know enough about the mathematics involved, so I cannot tell how to judge the internal stability of the relative ranking system. In any case, if the relative ranks are sufficiently stable and correct, then readjusting the entire block makes a lot of sense and is simpler than my proposal.
A remaining question is whether there are enough anchor player games to warrant several adjustments per day. I also don't see any need to adjust the ranking system so often. It cannot possibly drift that fast.
One final question: If the system is working well, isn't this discussion superfluous?
I suggest you look at the KGS documentation already mentioned, links to which you can find at KGSRatingMath.
You will find that your "trickling down" is a good, while rough, description for the method by which the KGS rank system works: all players' ranks are continuously and iteratively compared, based on the game results, to their opponents' ranks and adjusted accordingly. This iteration is conducted as fast as the server can handle it, usually i experience rating shifts (when crossing a rank border) within seconds of finishing a game.