Forum for KGS Wishlist

Rating History: New Discussion [#763]

jasiek: Rating History: New Discussion (2006-11-27 10:36) [#2447]

(citation from old discussion page) wms: A game's weight must be the same for both players in the game. Otherwise the system becomes unstable. It would have to be games played by either player, which would make it less meaningful. Furthermore, I don't see any advantage of this over simply hastening the decay rate of a game (if what you really want is to speed up the rate of change in people's rank).
- RobertJasiek: Keeping the system stable globally is a good idea. However, this does not imply that a game's weight must be the same for both players in the game. E.g., if the weights differ for both players, then this difference can be balanced for the global stability again. Managing different weights for both players simply means that more thought has to be invested in maintaining global stability. If useful, e.g. because it would be considered good to have different weights for both players, then we can discuss getting necessary maths for global stability. How useful different weights for both players actually are still needs to be discussed with much greater care. - Why or in which sense do you consider "games played by either player" less meaningful? - Hastening the decay rate of a game is one of the many suggestions; it will have to be discussed in detail as well. Before we discuss all possible suggestions related to a player's history of games, it is indeed hard to see advantages of one suggestion over another. This is no reason to already dismiss some suggestion, but merely suggests the need for careful discussion and methodical comparision of all the suggestions.

213.73.115.200: Countering Global Rating Drift (2006-11-28 08:16) [#2455]

RobertJasiek: The rating system's global stability means that 1) the ratings of all players considered together do not drift and 2) the distance between every two ranks remains constant. I hope that (2) is ensured already by definition. In principle (1) is or can be affected by a) players leaving the population, b) players entering the population, c) asymmetrical rating changes of every single game. I do not know for sure if (a) and (b) occur in this rating system, but I think that yes. (c) can have several causes. Currently we are not sure if it is affected by the confidence weight. (c) would be affected if we should change to "dependency on game number". - Currently there appears to be a rating drift anyway, and this is another problem to be solved, regardless of whether we use (something like) "dependency on game number". So, as far as I understand the rating system now, using such would not create an extra principle problem, but only possibly increase the urgency to solve the global rating drift problem anyway. - It is fairly easy to counter a global rating drift, as follows: Each game means a preliminary rating change of c1 for player 1 and c2 for player c2. Altogether the preliminary rating change per game is c := c1+c2. For the global change, sum up all games' preliminary changes c, giving the sum C. Next this C is redistributed over the all player population. There are several ways to redistribute. One way is to redistribute only to the winners of games if C<0 (smaller than 0 because the negative of C has to be redistributed) and only to the losers if C>0 (greater than 0 because the negative of C has to be redistributed), ensuring for each player a rating increment if he wins or a rating decrement if he loses, considered per game. Such a modifying shift can be redistributed either as an absolute fraction of C for each affected (because either won or lost) game or as a relative fraction of C: the greater either c1 or c2 is (the one of the two which is modified at all), the greater the modifying fraction of C applied to it becomes. E.g., linearly. - Doesn't this solve any rating drift problem?

yoyoma: Re: Countering Global Rating Drift (2006-11-29 02:14) [#2462]

Robert, regarding rating drift, I think there are a few things this can refer to.

Global drift (inflation/deflation). Over time ratings could inflate, so that without improving, a 1d could inflate to become a 2d. Or a 1d could deflate to become 1k. I believe KGS has "anchors" to try to counter this. Some admin picks players believed to have stable strength, and tells the system to try to keep them near a certain rating. This should counter what you talk about in "the ratings of all players considered together do not drift"

Local drift. Since KGS constantly reevaluates your rating even when you don't play games, if your previous opponents collectively improve and increase their rating, your rating will go up as well.

During the transition period to new constants and/or other algorithm changes, there is an initial spike, but usually for some time after there is settling period. For example in the transition to KGS3, generally players got closer to 2d in a sudden move. But for sometime after I still observed some movement towards 2d (kyus drifted up, 3d+ drifted down). This seems to have mostly stopped by now.

My guess is that when you said "there appears to be a rating drift anyway" you were observing the local drift and the settling from the change from KGS2 to KGS3. I don't think we need anything like you propose for global rating drift correction.

The issue that it takes too many games to change your rating if you played often in the past is a separate concern which I will talk about in a separate post.

- yoyoma

131.111.197.54: Re: Countering Global Rating Drift (2007-07-02 02:34) [#3513]

quote:

Local drift. Since KGS constantly reevaluates your rating even when you don't play games, if your previous opponents collectively improve and increase their rating, your rating will go up as well.

Is this really true? Please explain why the current progress of the opponents you played in the PAST affects your rating now at all. It does seem that this is the source of global drift. It's against common sense as well. the opponent i played keeps winning, that means I improve? nonsense.

213.73.116.113: More first discussion of the suggestions (2006-11-27 16:32) [#2448]

RobertJasiek: Let me start with the suggestions that I think are not so interesting. - "Value the last 25% of a player's played games more." is doubtful because 25% is an arbitrary number. Why not 23% or 26%? - "starting afresh as ?" is not a bad idea and could be implemented, but it does not solve the problem. It only constantly circumvents it and the rating system remains doubtful. - "The rating system must remain stabil globally." should be remembered but is not a solution in itself. - "allow faster rating changes" can be used in assistance but alone it is insufficient. - "The rating system must model reality." is a binsenweisheit, which itself does not solve the problem. - "Greater transparency" lets players have less doubts about the (existing or missing) degree of quality of the rating system; it can be helpful but it alone does not solve the problem.

The next question is: Should the current method, "the dependency on the number of days since when a game was played", still be used together with other methods or be replaced by other methods completely? Does usage of also other methods make that method itself any better? E.g., one might weigh that current method by 1/2 and some other new method(s) by 1/2, i.e form an average of old and new. This way, the old method is not influenced by the new method(s), but only reduced in weight. Not influenced then also means: not improved. Thus methodically the old method would still work as badly as it does currently. (The general agreement is that the old method does work badly. To give just an example as a reminder: It is better for one's rating increment to play not at all for 30 days and then to win 69% of one's games for 45 days than to start winning 69% of one's games for 45 days immediately. It is advantageous to let the current rating system forget more of one's old games.) Concluding, I recommend very strongly to replace the current, old method entirely by some new method(s).

For one particular suggestion, Harleqin goes already into some detail: "There should be the additional factor X/(X+games) to the weight, where X would be an analogon to the halflife by time." Maybe this could explained more? What is the purpose of exactly this formula? Why exactly this?

Now to the new methods: a) "dependency on the number of games played after every particular game", b) "decrease the period of including old games", i.e. make it smaller than 180, c) "decrease the halflife", i.e. make it smaller than 45, as it currently is for dan players, d) "use old data only if they support new data", e) "limit the number of games that are reevaluated for a player's current rating", f) "weigh older games (in contrast to less old games) less than currently", i.e. let the weighting function fall faster to low levels, g) "do not use the history of old games at all". - After some preselection will be made for one or some of these methods, one can then discuss mathematical details for them with greater care. So far maybe maths can be mostly avoided until a preselection. I encourage everybody to discuss merits and demerits of the possible methods and to compare them. - One thing should be clear immediately: (g) is the most radical proposal because it requires a fundamental change of the rating system from currently being a maximum likelihood derivate to, e.g., an ELO derivate. - Luckily, we are not short of methods and have a good collection available for choice. Maybe even further methods will be discovered during discussion. We should not make a rash decision because our choice is very important and will greatly affect how much the quality of the rating system is going to improve.

Harleqin: ((no subject)) (2006-11-27 18:32) [#2449]

Explanation for my factor: the proposed factor is the simple way to let the weight decay exponentially.

Currently, as far as I know, the following weight factors apply:

Time decay: 45 days/(45 days + game's age)
- This factor applies to the game, reducing its weight for both players by the same amount.
- '45 days' is the halflife (or time halflife to differentiate between this and the proposed gamenumber halflife). After 45 days the weight has halved.
- The decay is exponential. After 180 days (current cutoff) the weight is decreased by a factor of 16.

Confidence adjustment: KGS' confidence in the opponent's rank
- This factor applies to the game's weight for each player separately, using always the opponent's rank confidence, i.e. it is assymetric.

My proposal is to include another assymetric factor:

Game number decay: X / (X + number of games since)
- This factor should apply to the game's weight for each player separately, using always this player's number of played games since that game.
- 'X' is the gamenumber halflife. After X games the weight has halved.
- The decay is exponential. After n times X games the weight has decreased by a factor of 2**n (2 to the power of n).

In order to make the games' weights not decrease too fast, the time halflife and the gamenumber halflife would have to get higher than if they were applied without the other. A first suggestion could be 90 days and 60 games, but this has to be tweaked. Another possibility is to keep the old time halflife but to take the square root of both factors:

W = W0 * sqrt( H / ( H + age ) ) * conf * sqrt( X / ( X + num ) )

('W' weight, 'W0' default weight, 'H' time halflife, 'age' the game's age, 'conf' opponent's confidence factor, 'X' gamenumber halflife, 'num' this player's number of games since)

As for Robert's categories of new methods:

'a)' (dependency on game number) would be something like my proposal.

'b)' (decrease the cutoff) would cut off old games when their weight has not already decreased that much, resulting in more 'rank jumps' due to the cutoff. That means that in order to be effective, it would require undesirable side effects.

'c)' (decrease the halflife) would only help those players who play a lot of games, but players with less games will have a harder time to keep their rank confidence, i.e. there will be more [xx?] players.

'd)' (use old data only if it supports new data) is a joke.

'e)' (introduce a game number cutoff) is possible in two ways: together with 'a' or alone. Together with 'a', it would be a bit redundant and not necessary. Alone, it is a more blunt approach than 'a', but the effect similar.

'f)' (let the weight decay faster) is the same as 'c)', just in other words.

'g)' (do not use the history of old games) means scrapping the KGS system altogether (as described).

I think either 'a' or 'e' would address the problem in a helpful manner. 'e' seems a bit blunt to me but is the simplest way. 'a' could be more elegant but is also more resource-consuming.

213.73.116.113: Reply to Harleqin (2006-11-27 18:57) [#2450]

RobertJasiek: I have a few questions about your comments: Can you explain your remarks about (b) and (c) in more detail? I do not consider (d) a joke. If the old data can be used (because they go in the same direction as the new data), then they improve the significance. If the old data cannot be used, then the proposal is like (b). Can you elaborate more on why and how (e) is blunt and how it compares to (a)? (f) is not (c) in other words because (f) might also use other functions, e.g., for every e(-ax) function for some suitable a>0, but also other exponential, monotonously dropping functions are possible. Ok, (f) and (c) are closely related, or we could simply consider a broader set of possible functions in (c).

blubb: rank confidence (2006-11-28 04:23) [#2451]

Harleqin, about the current effect of rank confidence on ratings, I disagree. As far as I can see, it does *not* constitute asymmetric weighting. Each game result is weighed according to both players' rank confidences, which means, symmetrically.
In the calculation of a player's rating, the factor which stands for that player's own rank confidence can just be omitted, because it is constant over all results, and therefore, cancelles out.

Harleqin: own confidence (2006-11-28 03:35) [#2453]

If your games were always weighted by your own rank confidence, then you could never get away from 0 confidence.

I think that even if this is countered by the lowest possible confidence being greater than 0, the system would show strange behaviour: it has no confidence in your rank, therefore it doesn't believe your results (?!).

I am sure that someone has a real insight into the true workings of KGS and can tell us.

This nonwithstanding, I still don't see the looming instability from assymetric weighting. Can someone either bring up empirical or simulated data, some ab initio calculation, or a structured reasoning showing this?

blubb: Re: own confidence (2006-11-28 03:42) [#2454]

Right, there is a (removable) discontinuity at a rank confidence of zero.

I'm surely not the one you're asking for, since I haven't seen the code explicitely by myself. For the same reason, I don't know how the algorithm practically gets away from a player's initial zero confidence. Possibilities include

using a predefined minimum rank confidence (e. g. derived from some a priori known distribution of players over ranks), or
omitting own weights in the calculation alltogether (so there's no need to ask Bernoulli-L'Hospital for assistance).

Because at a given point of time, the own weight factor is constant throughout one's record, even the second way doesn't affect the weighting symmetry in the main rating algorithm.

About the additional rules applying to nearly empty records (e. g. the 10? limit) I don't know enough to state anything, though.

wms: Clarifications (2006-11-28 09:55) [#2456]

A few clarifications:

weight=H/(H+age) is not exponential decay and has no half life. KGS does not use anything like this. Exponential decay means something like weight=e^(k*age), where k is negative.

The instability I mentioned is related to groups pushing themselves around to ridiculous ranks. You can get a situation where for some reason a group of people who play each other have their wins weighing more than their losses, so as a group they keep climbing in rank with each iteration. Even when this climb is bounded eventually it becomes very noticeable when you get a large enough group of players. I've fiddled with asymmetric weights in the past, and stuff like this kept happening. It doesn't happens when I use symmetric weights (unless I break something else that is). I don't plan on spending more time fiddling with asymmetric weights again.

213.73.116.151: Combining new methods and symmetry (2006-11-28 10:24) [#2457]

RobertJasiek: This kind of instability would not be treated by correcting rating drift only globally, is important, and it does not seem like it could be treated by easy means. So now I understand why currently you object to asymmetric weights. - Thus can we discuss whether suggestions like (a), i.e. "dependency on game number", can be made working together with the symmetric weights requirement? A possible approach is: For a game, each player contributes 1/2 to the game's rating change. Each player's contribution can be calculated (also) due to his historical context (e.g., his dependency on his own number of his played games). Then the sum of the rating change for the game is formed, using both players' contributions. Then the rating change is symmetrical: plus half the sum for the winner, minus half the sum for the loser.

Harleqin: Re: Combining new methods and symmetry (2006-11-28 18:46) [#2459]

Robert, there is no such thing like a 'rating change per game'. All ranks are iteratively recalculated every few seconds by the aforementioned maximum probability mechanism, even if no new game results are added.

213.73.68.178: Re: Combining new methods and symmetry (2006-11-28 19:11) [#2461]

RobertJasiek: Ok. I am not used to thinking in terms of maximum likelihood yet. Maybe my suggestion works for ELO derivates. How to translate it for maximum likelihood systems?

yoyoma: Re: Combining new methods and symmetry (2006-11-29 02:21) [#2463]

I think it's possible to add number of games played as a factor in the weight and keep it symmetrical. Currently any game's weight is a function of (age, playerA's rating confidence, and playerB's rating confidence). So you can make it a function of (age, playerA's rating confidence, playerB's rating confidence, # games played by playerA since, and # games played by playerB since).

A drawback can be now if you are seeking to improve your rating quickly, you would want to avoid playing people who play often, because then that game would become weighted less as they played more. As an extreme example, suppose you beat someone and then they play 1000 games that day. Your victory will have almost zero impact on your rating.

- yoyoma

213.73.68.162: Re: Combining new methods and symmetry (2006-11-29 08:04) [#2465]

RobertJasiek: Concerning the application of weights, your description is roughly what I have imagined. - Concerning the drawback, it thus seems to be a good idea to combine (a) "dependency on game number" with (e) "game number cutoff". For each player, the same game number cutoff should be used. Thereby the drawback does not occur. (Players with still fewer played games are the "?" players, which defines them cleanly and makes it clear to everybody that a game against those players does not have such a great weight as a game against a regular player. A "?"-player's game number could even be shown by the server.)

213.73.114.87: Re: Combining new methods and symmetry (2006-11-29 09:45) [#2466]

RobertJasiek: I have to correct myself a bit: Using (e) together with (a) does not eliminate the drawback, but restricts it. - You are right that an opponent's high activity after the played game reduces its weight. However, at the same time a different opponent's low activity after a played game keeps a high weight. Every player should play against different opponents. So naturally an effect of his opponents' activities goes to an average. This is not bad. Besides recall that a player's own (a) weight contributes half while his opponents contribute only the other half, what further restricts the drawback if on average you really should have had bad luck with your opponents in that most of them would have above normal activity after you have played against them. - More fundamentally, reevaluation of games is the very purpose of maximum likelihood systems, so one can also simply call it a good feature that continuing activity of one's opponents give further causes for reevaluation. In other words, if your opponents are active, you should feel motivated to be more active, too, so that the system can evaluate more fresh data for you. One may dislike maximum likelihood systems fundamentally (as I do), but if they are being used, we have to live with their fundamental behaviour of frequently reevaluating everything.

blubb: Re: Combining new methods and symmetry (2006-11-30 03:03) [#2467]

If I understand you correctly, by cutoff you mean to abandon games beyond a certain age completely. Those games' results generally are still valuable data, since they have an - albeit weak - correlation with current strengths.

Their optimal weight would be inversely proportional to their variance, compared to current games. Assuming that the ratings as of two months ago told as much about the actual ratings one month ago as the latter now tell about current ratings, we naturally get to exponential weight decay of the underlying game results.

Dropping games beyond a certain age is required for limiting the computational efforts. However, artificially worsening the data set by needlessly ignoring significant information would be against the grain of the whole ML concept, which aims at determining the best estimate of current strengths available.

Regardless, I'd agree that the KGS ratings could have done better recognizing your last months' progress. Even with plenty data available, ML estimates can be disappointingly bad due to very unusual behavior or improper choice of parameters. In my guess, the offset was caused by a bit of both.

It's not difficult at all to counter these troubles and make the system more responsive (just shorten the halflife time sufficiently), but that comes at the cost of less accuracy elsewhere. Either, more seldomly playing participants will get a [x?], or, if the threshold is lowered accordingly, "solid" ranks will be less certain.

blubb: symmetric weight depending on game order (2006-11-29 16:03) [#2468]

In my view, there is a chance that an exponential decay component depending on the order of games in a player's backwards history could make sense (with symmetrical weighting). One could argue that e. g. the hundredth last game in player A's record not only provides some information about A's current strength, but also implies a slight hint that A may have improved more than other players who have played just three games since then.

Although the unintended side effects mentioned earlier certainly would occur, they might be outweighed by the gains, given that the "halflife count" is not too small. In the rating calculation of player A, a game x with opponent B then should be weighed according to

w(x) = confB * 2^-(time(x)/t_half + (nA(x) + nB(x))/n_half)

where

time(x)  is    the time passed since x
nA(x)    is    the number of games played by A since x
nB(x)    is    the number of games played by B since x
t_half  is   the time after which a game assumes half weight (all other factors being equal)
n_half  is   the time after which a game assumes half weight (all other factors being equal)
confB   is   the weight factor based on B's rank confidence
             (based on the second derivative of the previously determined
              probability density of B's rank distribution at its maximum)

(Again, confA is constant over A's games and omitted.)

I have no clue what value of n_half would be appropriate, though. Currently, it is effectively infinite. Perhaps that is best already?

Harleqin: Re: Clarifications (2006-11-28 18:47) [#2458]

wms, sorry, I was being stupid. By the way, your k=-(ln 2)/H.

Harleqin: Re: Clarifications (2006-11-29 03:05) [#2464]

wms, if you actually use the rank confidences symmetrically, how do you ensure that new players can gain any confidence? If you use it asymmetrically, how do you deal with the looming instability that you described?

84.56.16.84: Automatic "trickle down from anchors" rank realignment (2006-12-05 12:52) [#2503]

hgmichna: I haven't seen any good proposals on countering the long-term rank drift yet, so let me make one.

The idea is that ranks should be realigned relative to known strong anchor players. It is like letting their rank "trickle down" through the layers of weaker players.

This could be done in every game, but I think there is a much simpler way. Whole ranges of ranks could be automatically realigned monthly (or perhaps weekly for weaker players).

The method is actually very simple. Take a range of ranks, like 3d to 1d, and calculate the winning to losing ratio of all those games that were played against stronger players. If the winning to losing ratio is 1, all is well.

If more games against stronger players were won than lost, raise the ranks of all players in the range by an appropriate amount. Conversely, if more games were lost, lower the ranks. Then add or subtract the same amount from the ranks of all players below the range as well.

Then take the next range below, (1k to 3k), and repeat until all players are readjusted.

The "rank nudge" amount should be smaller than the theoretically required one to achieve balance, relying on the fact that there will be the next adjustment soon enough. The factor could be adjusted after observing the system in action. The more often this adjustment is done, the smaller the factor can be. If a very small factor were used, the adjustment could even be done daily, but rank drift is probably not that fast.

It is not necessary to make any adjustments within each dan or kyu range, because the existing ranking system takes care of that very well.

I don't know what the optimal ranges are, but there is no need to strike any mathematical optimum. The system is obviously stable anyway, if the few parameters are not choosen way out of all common sense.

Probably a mathematician could come up with a more elegant algorithm to achieve the same thing, and I would appreciate that very much, but at least the idea is now on the table.

blubb: Re: Automatic "trickle down from anchors" rank realignment (2006-12-05 21:41) [#2504]

As far as I know, the algorithm currently used at KGS (ML with anchors) does this already - in a very smooth manner.

Pure ML merely gives relative ratings. They are recalculated many times a day. Based on a set of anchor players, those relative ratings then are translated to an absolute scale.

Thereby, global rank drift is limited to the drift of the average anchor.

hgmichna: Re: Automatic "trickle down from anchors" rank realignment (2006-12-09 16:01) [#2508]

Sounds very good, and sorry I didn't know. I think the documentation of this part is not publicly available, or is it?

I think it is an interesting question whether the entire relative rank system is good enough, so it can be seen as a complete block of correct relative ratings from the strongest down to the weakest players.

Is there really no rank drift between strong and weak players? Does KGS really not need any trickle-down effect to realign lower ratings? Consider that there are very many more weak player games than strong player games. Unless there is some very clever weighting, the weaker players would dictate the stronger players' ratings, the tail would wag the dog.

But perhaps it still works. Perhaps it doesn't matter if the tail wags the dog, as long as the dog is led on a short leash. I don't know enough about the mathematics involved, so I cannot tell how to judge the internal stability of the relative ranking system. In any case, if the relative ranks are sufficiently stable and correct, then readjusting the entire block makes a lot of sense and is simpler than my proposal.

A remaining question is whether there are enough anchor player games to warrant several adjustments per day. I also don't see any need to adjust the ranking system so often. It cannot possibly drift that fast.

One final question: If the system is working well, isn't this discussion superfluous?

Harleqin: Re: Automatic "trickle down from anchors" rank realignment (2006-12-09 16:08) [#2509]

I suggest you look at the KGS documentation already mentioned, links to which you can find at KGSRatingMath.

You will find that your "trickling down" is a good, while rough, description for the method by which the KGS rank system works: all players' ranks are continuously and iteratively compared, based on the game results, to their opponents' ranks and adjusted accordingly. This iteration is conducted as fast as the server can handle it, usually i experience rating shifts (when crossing a rank border) within seconds of finishing a game.

Back to forum Back to page

New reply

Forum for KGS Wishlist

RecentChanges · StartingPoints · About

Edit page ·Search · Related · Page info · Latest diff