KGS Issue - Game Result Weighting
(Moved from KGS Wishlist / Game Handling)
Table of contents |
Basic idea
blubb: Weight ratings according to the average time per move (or at least, to the total playing time of a game), and use a continuous function to do so.
wms: This simply makes no statistical sense, blubb. Either games are slow enough to predict the strength of a player in "normal" speed games, or they aren't. If they are, they should be counted. If they aren't, they shouldn't. I just don't see why I would add inaccurate information to the rating system at any weight. Making it weighted less doesn't make it any more accurate, just possibly less damaging - but leaving it out completely will be even less damaging, so that's what I intend to do.
blubb: Well, for the case that your sample is infinite, I agree. Then an element (that is, a game) either contributes useful data to the evaluation (correlation being positive) and can be included with full weight. Or it doesn't contribute useful data and should not get any weight at all (zero or negative correlation). However, ratings are calculated from a finite set of results. I don't think this is the right place to go into details of theory here, and I am not quite familiar with english prob&stat vocabulary either. I'll give an (somewhat artificial) example, so it should be easy to see the consequences. (Sorry if that's "too much numbers". I want to keep the conversation as low-level as possible in order to allow non-math people to follow.)
- My point is not about absolute ranks, but about the spreading of rank differences. Imagine there are exactly two distinct types of games with different certainty of outcomes, that is, with different variances (represented by P-functions with different k values, in terms of the KGS Math help page). Neither the age of games nor the opponents' rank confidence shall influence the weight. Type B shall behave according to half of the k value of type C. The variance of rank difference is Integral(( (1/k) * ln(p / (1-p)) - 0)^2) dp, where p = 0 ... 1, hence type B has a variance 4 times bigger than type C (which I call varC).
- Now let's say, a particular user's record consists of 10 games of type B and type C each. If you don't weight, you won't want to take account of type B games - they would eliminate a considerable part of accuracy. The resulting rank diff variance of all B- and C-games together would be ((10*1 + 10*4)/(10+10)^2) * varC = 1/8 * varC, while you can get a narrower variance of (10*1/10^2) * varC = 1/10 * varC by evaluating the ten C-games only.
- But taking the average outcome of four B-games for a single C-game's result, they can be included very well: just assign B-type games a weight of 1/4, and C-games a weight of 1, so you can treat the whole set as 12.5 games with a variance of varC each. The resulting rank diff variance is even better: ((10*(1*1^2) + 10*(4*(1/4)^2))/(10*1 + 10*1/4)^2) * varC = 1/12.5 * varC. That way, the B games contribute valuable information, resulting in more accuracy than the C-games can provide alone.
- (I have chosen "B" and "C" to indicate that there could be "A-games", not deserving any weight > 0.)
- For optimal weighting, the product weightingfactor*variance needs to be constant for all game types. Of course, that also applies if there are no distinctive types but a continuous range. The weighting function suggested below is an attempt to roughly approximate the bigger influence of "luck" in speedy games. (By the way, games faster than m are left out completely.)
idigo: We seem to be mixing up two ideas here. Certainly it makes some amount of sense that making the time settings shorter will produce a game with results that are in more variable in outcome, on average, than games at standard timing. To assume that in fact the variability is approximately constant for the time settings considered is most easily thought of as an approximation within the statistical model, to be sure. But the entire statistical model is itself only an approximation in the first place. To justify making the model "more accurate" (and hence more complicated) on any given front, you'd have to demonstrate to me that the gains are significant.
Just because a variable correlates with KGS ranking at standard time does not mean that we should use it to establish KGS ranking. When you start playing at very fast time settings, different abilities are stressed compared with normal-speed Go, some of which we might not even consider to be Go-related. There are probably many players who would gain three stones if forced to play at very fast settings. There are probably many who would lose three. This observation alone is enough to justify restricting the range of time-settings which define rated games. If we were to establish a ranking tournament based on games of type C, and find that for the bulk of the population, their "C-rank" is "close enough" to their current KGS rank, then we could include type-C games in our ranking system if we so desired. Otherwise, we probably shouldn't.
blubb: Even if those C-ranks were off by three stones on average (with the same mean, i. e. no global shift), the information provided thereby would be more useful than "no information". Therefore, to gain the best estimate available, those data should be considered, albeit with comparatively low weight. Of course, if their significance was too low, that might not be worth the effort. In my view, it makes sense down to a resulting weighting factor of 1/16 or so.
Practical Concept
blubb: The ability to recognize that the opponent just has moved and where, then to move the mouse to some point and finally to click the button there; furthermore (and not at least), to have a fast PC and internet connection - all this is insignificant for what I'd call an appropriate rank, and hardly correlates with players� strength. Hence, a weighting function is advisable that depends on the part of the players' total activity which, on average, is spent on those matters. I assume something like w := max(0, (T-N*m)/T) would work fine, where m stands for the time practically needed to perform a move without spending any time on thinking. N is the total number of moves played in the game and T is the total playing time, making (T-N*m) the time "left for thinking". Maybe 2 seconds per move (s/move) would be a good starting point for m. Using this m value, a game lasting for 80 minutes and consisting of 240 moves (which means it�s a 20 s/move-game), would weight 0.9 times the maximum that could be achieved. A 5 s/move-game would get 0.6, and a game of 2 s/move or less would get zero weight. The very minimum of m could also be estimated by experiment: let a representative set of people try to place as many stones per minute as they can, then m approximately is the average time per move.
Followup discussion
- Rakshasa: It seems more logical (and much simpler to implement) to require ranked games to be blitz or slower. The code should already be there for detecting ultra-blitz. Then noone can complain about ranks being unfair because of weights. (This is done in 2.5.8.)
- blubb: To weight the games 100% or 0% only, is simpler, that's true. About the "more logical" issue, I cannot agree, though. In fact, each game at KGS is already weighted continuously (with the weight being anything between 0% and 100%), depending on the opponent's rank confidence, as well as on how old the game is. There are games which get 0.10 (10%) of the maximum possible weight, while other games get 0.93 (93%). I have never heard complaints about this weighting, but I have heard complaints about ranks being unfair because of the weighting function doesn't depend on game speed.
- Rakshasa: Those weights are related to how long ago the game was played and the weight of the players. It's not that weight i'm talking about here, each game has a constant base weight. (It's worth a single win or loss) If time changes the base weight, you suddenly end up with more games with way longer time that is needed. Is it fair that a game with an arbitrarily long game time gets weighted a lot more? (They might not even spend half the time) What i think is fair about the ultra-blitz cutoff is that those games are mostly used by either time cheaters or those who want to have fun playing blitz.
- blubb: If you think about the w function given above, carefully, you will recognize that it does respect these two components, but in an gradual instead of the "all or nothing" way. The "not-possible-to-think-in"-time stands for the part of activity which shouldn't contribute to the rating. What KGS calls "ultra-blitz", e. g. a 3 s/move-game, mainly consists of that kind of time, hence it would get a very low or no weight at all. On the other hand, the w-difference between, say, 10 s/move (80%) and 20 s/move (90%) is rather small (10%), but there is one, because the 10 s/move is slightly more influenced by not worth to be rated factors. Compared to the huge weighting differences which can occur due to different opponent`s rank confidences (which KGS users usually don`t know much about, either), the w-differences between reasonably timed games are hardly noticable. I would not expect serious players to play slower without thinking deeper just in order to make the result slightly more valuable for the ranking calculating algorithm.
- Rakshasa: What you seem to miss is that if a player only plays games with the same time then all of those games will have the same weight. If ultra-blitz only counts for 10% of normal games, then he'll just have to play a few more games before he gets a solid rank.
- blubb: That's exactly what weights are about. The influence of non-go-related stuff is higher in faster games, and so is the variability of outcomes. Hence more than one fast game is needed for giving the combined result the same reliability that a single slow game's result provides. -- And again, m could be adjusted to any value, e. g. 10 s, giving even a 10 s/move-game no weight at all. I just want to point out that, whatever the treshold is set to, a game that barely exceeds that minimum time shouldn�t get the same weight as a long-time game.
- Krit: I disagree. Whatever time constraints (blitz or slow -- not including ultra-blitz), better player wins. It's the player's choice to play under what time limit. This is taking rating too seriously. Long time limit doesn't necessary reflects the true strength. I'll give an analogy. Suppose there is a Life and death situation, given 5 mins, I bet 5k will also get the same answer as a 5d. Does it give you any indication of stregth now with long time limit? If you give 20s, a 5d will most likely be able to read the solution out.
- Velobici: From reading Matthew MacFadyen's comments regarding Ingenious Life and Death Puzzles Volume 2 on the Problem Book Grades page, it appears that there is good reason to believe that no matter how much time a 5k has he will not "get the same answer as a 5d". Witness Matthew MacFayden's statement "the last 15 problems, which took over 4 hours and I still got 10 of them wrong". It would be reasonable to assume that a professional would get them all right and not need 4 hours.
- Krit: Not entirely true, there are simpler life and death problems. In a real game, if a group might be killed a 5k might find a way to save if given him time. So it can't be said that winning blitz games are not due to strength and the winning percentage aren't reliable. it's the players choice (both players).
- Also, problems in life and death books rearly appear in real game. There are only a handful of (basic) tesuji, corner and side patterns that will occur in a majority of games. You don't always need to find the best move to win a game, sometimes a move that works is enough.
- Reuven: It'd create another problem - What if one finds long games more tiring than blitz? (blubb: Do very slow games have a bigger variance than medium speed ones? Can't believe that.) Do you think it'd be appropriate to enter another variable? Checking the time against the avrage time or perhaps against the time settings preformed best at? (Providing it's not blitz?) Should this also be affected bythe time of the day? (Would I be pushing it, suggesting to connect electrodes trying to determine ones state of mind, mood?;)
- blubb: I suppose you're right, there are more influences to k (and to the "rank diff vs. result oftenness" distribution in general) than opponents' rank confidence and playing speed. I haven't seen any of them ever implemented, though. :)
There's a lot of interesting ideas here, but I think that using Ockham's razor is important here. This is something both scientist and programmers should always keep in mind. The current system is good at finding good matches from my experience.