UGS Ranking System

   

Table of contents

Ranking Systems

nachtrabe One of the things that is going to need to be settled on eventually is a ranking algorithm, so I figured we should get a page started discussing ranks, the issues involved, and different implementations. Presently this is just the broadest of overviews that I've thrown together, later I'll post more of the nitty-gritty, along problems and potential solutions.

Ranking vs. Rating

It seems that much of this page is discussing a Rating system rather than a ranking system. Rating is the adjustable rank like ELO. Rank is typically used as a maximum value attained by a user (like the Japanese pro rank system).

Storing both values for any given user would be beneficial! The rank should be stored in any commentary / analysis feature. The rating should be generally displayed as the user's current level.

I would suggest something like:

Rank = max(mean(1 month of solid rating))

which means if a user does not play enough to get a solid rating and keep it for a month, that user's Rank = 30k. If a user has a 5k solid rating for one full month, then drops back down to 7k, that user obtains a 5k Rank.

As with anything, there is some opportunity for abuse, but a "[ext] volity" style (see /protocols?) confidence rating might also be useful.

nachtrabe As near as I can tell, the choice of word usage between rating and rank is completely arbitrary and any discussion on the matter is just semantics. The literal meaning of rank is "a relative position or degree of value in a graded group," so that is what I used here. I wonder about the use of keeping track of a maximum as you are proposing (since it can be inferred by looking at a rank graph), but doing so isn't terribly difficult once you have the rating/ranking algorithm in place.

If we use an MLE (the system I've discussed below) then confidence intervals take care of themselves--we can calculate the variance and use that to determine the confidence interval assuming a normal distribution. That information can be used to determine whether a rank/rating is "solid."

Malweth: Although not a criticism of your rating system, the difference is not semantical and, in the overall purpose of this server, a rank (as defined above) is as important as a rating (as potentially defined below). I agree that the two terms are often used interchangeably, but I believe that they should not be in the context of this server.

Because this server's drive is an all-encompassing solution, this indicates that a user needs to keep the authoritativeness that they've earned. I've had a hard time keeping a stable rating sometimes (lately) on KGS yet I still consider myself 7k.

A seperate Rank vs. Rating system would take care of this without any problems. A good example of this is the IGS [ext] Rating Certificate. Also reference Rank and Rating.


Bildstein: Recently I've been considering the possibility of a go server that does not show you your rank, and nor does it show you anyone else's rank. Instead, it could simply tell you the difference in strength between the two of you, and on that basis you could decide how to play. Wouldn't that be novel? There'd be almost no incentive to try to increase your rating, because you'd probably never even notice if you did! And it would completely do away with issues like rank inflation. The more I think about it, the more I think I'd like to play on that go server.

As a side issue: does anyone know if/where this idea has been discussed previously on SL?

Malweth: About the closest thing I could find was Rating Paranoia. Although Rank Inflation may seem gone in this type of system, I don't believe that it would go away - the rankless system must still apply a rating to each player in order to determine the relative difference between two people. If one of those people is stronger or weaker than that relative difference allows, they will have Rank Skew? of some type... actual rank inflation can still occur at the server level, though it doesn't really matter anymore.

I like this type of system in essence, but I think it would be better suited as an optional system (much like the KGS [-]). This would allow all people to be ranked, but not necessarily see that rank (or the rank of others). This is something client side that could be easily introduced into IGS or KGS.

Benjamin: Interesting idea, but I think Malweth is right: "If possible, let the user decide" already is an inofficial design guideline of UGS, I guess ^^

Software Testing Ground

nachtrabe I've [ext] created a little software to test out some ideas and see how they work in practice. This is mainly to try out different algorithms and tweaks and to experiment with different optimizations on those algorithms.

The software is written in python (my Java is a little rusty and I didn't want to have to look up anything), but should be easy to port to any other language with very little difficulty. It is a little hackish, but has been reasonable ground for me to test different ideas and see how they play out. I've released it under the modern BSD license--so it can be used wholesale, in pieces, or whatnot for whatever purpose by anyone.

Run it by entering:

 python Player.py

from the command line.

Later I'll try and implement a few different ideas so that we can compare them (including at least one point-based scheme), improve the documentation, and add a little more of an interface so that we can better represent an evolving dataset and simulate what this will look like in production.

Let me know what you find and any ideas for modifications, etc.

(note, it may take a little while for the above link to work, I'm having some uploading issues).

Issues

Issues that we have to deal with for this:

  • No accurate initial first guess.
    • Rank cheaters exist (sandbaggers, etc).
    • No one is paying for membership.
    • Multiple accounts for different states (inebriated, etc) possible.

Implication: Updates to rank should be done based on an aged function of opponents current ranks, rather than depending on the rank at the time when the players played being accurate.

  • Ranks will move quickly among lower-ranked players.

Implication: The ranking system is going to lag the players actual ranks at the low kyu levels. It also means that we have to be careful about a fast-climbing player "clobbering" slower climbing players.

  • No standardized tournaments to decide on rank.

Implication: Games may be played at any time against anyone with any handicap and komi and not in a set with predetermined opponents. Playing games continuously instead of in discrete little bunches presents its own problems.

  • Very very large scale.

Implication: Potentially several thousand players who won't necessarily let things run on a "gentleman's agreement" in terms of how they behave.

  • People will want to play with different komi and handicap values.

Implication: See above. Not only will people need to be able to play with different rank differences, but they are going to want to (I was watching a game between a 7d and a 1k at 4 handicap stones earlier).

  • Most games are going to be close to even.

Implication: Cannot depend on a large range of scores from games.

Then there are simply some difficulties in ranking in general:

  • Handicaps are not strictly transitive.
  • Value of handicap stones are estimated at 13.5 points apiece, but this isn't really a constant.
  • People's ranks change.
    • Quite potentially when they aren't playing on a server despite having an account there.
  • Incentives created by ranking system.
    • Examples include the necessity to create new accounts on IGS, and the disincentive for playing in UCSF (see [ext] Glickman 1995).

Approaches

There are two basic approaches to the problem of ranking. EGF-style (where players shift by a certain about every time they play) and maximum-likelihood methods (where we attempt to estimate their rank based on all of their games).

nachtrabe I do not believe that EGF-style systems are particularly well suited for our purposes--with no clear initial guess, players that are changing rank quickly (both up and down), no specific tournament format, and it isn't conducive (as near as I can tell) to generating confidence intervals. This kind of system can also have "pockets" of weirdly underrated players in some implementations. I am only going to elaborate on and discuss the complications involved in a KGS-style system at the moment, but I feel obligated to mention it and will elaborate a bit further on EGF-style systems later so that people can better compare the two.

Probability of a Win

One of the things that has to be decided is the probability of a victory given the rank difference between the two players. There are three equations that have seen widespread use in KGS, NNGS, DGS, etc. For all of these, R_A and R_B are assumed to already factor in things such as handicap, komi, and who moves first.

nachtrabe: I finally got a chance to analyze the [ext] data found at the EGF Official Ratings website. I experimented with three different probability equations to see which would give me the best fit. Interestingly, I found that the data seemed to break down along lines similar to the ones used in the modern KGS ranking algorithm (roughly 20k-5k, 5k-1k, 1d+). I'll experiment more to figure out better boundaries, along with some curve fitting. Most of this can wait for much later and I'm not sure we want to mimic the EGF values anyways, but then again I just enjoy playing with it :)

Old KGS Equation

This is the one currently implemented in the simulator, I'll look into adding the others later.

 P_A(B) = 1 / ( 1 + exp( k * ( R_B - R_A ) )

New KGS Equation

 P_A(B) = r + (1-2*r) / ( 1 + exp( k * ( R_B - R_A ) )

r represents the minimum probability of a win. For KGS this is 0.005 (0.5%). It should also be noted that now k varies with rank.

NNGS/IGS Equation

 P_A(B) = 0.5 * S^(R_A - R_B) = 0.5 * exp( ln( S ) * ( R_A - R_B ) )

For IGS:

 S = 9/16

For NNGS:

 S = 4/9

Also seen:

 S = 1/e

Maximum-Likelihood

Math!

Note the method described here is for the single-variate case, this can get much more complicated in multiple dimensions though the basic methodology is similar.

The idea here is that the probability of a victory can be calculated to be a specific probability, generally using a function of a form P_A(B) = 1 / ( 1 + exp( k * ( R_B - R_A ) ) where R_A and R_B take into account things such as komi, handicap stones, and who moves first. There is also P_A'(B) = 1 - P_A(B) which represents the probability of that individual losing.

The probability of a series of independent events happening is the product series of the individual probabilities. Thus, the probability of 3 coin flips turning up heads is 1/8 (1/2 * 1/2 * 1/2).

Therefore, the probability of someone getting the exact win-loss record that they have is:

 l = P_A(B_0)*P_A(B_1) * ...
 l' = P_A'(C_0) * P_A'(C_1) * ...
 L = l*l'

Where B_0, B_1, ... are the opponents the player has defeated and C_0, C_1, ... are the opponents who have defeated the player. L is the so-called likelihood of that exact set of events (assuming they are independent).

Products are ugly to work with mathematically, so we can convert it into a sum by using the property that:

 ln(a*b) = ln(a) + ln(b)

Where ln represents the natural logarithm.

 Λ = ln(L) = ln(l) + ln(l') = ln(P_A(B_0)) + ln(P_A(B_1)) + ... + ln(P_A'(C_0)) + ln(P_A'(C_1)) + ...

Now we are looking for the maximum value of that function. For that we take the derivative of Λ with respect to R_A and solve for zero.

This method can be done iteratively with the approximates for the R_B and R_C values to solve for R_A.

Finding the variance (the square of the standard deviation) from this is fairly trivial, which can then be used to calculate confidence intervals.

Note that this system can be weighted based on the age of the game and the confidence in that players weight, along with any other factor(s) that we choose. These weights are accomplished by multiplying each of the log calculations by the weight we want to use (this is based on the property of logarithms that ln(a^b) = b*ln(a)).

Iterated Solution

For a large multidimensional system, such as this one, there are two basic ways of solving the system of equations that is generated by this method: all at once or iteratively.

All at once would be very tricky and runs into some complications I may go into later (such as the lack of a good zero-finding algorithm or a fallback algorithm that's guaranteed to work).

Iterative solutions are the best for our purposes. Here the ranks of the other players are assumed to be correct for each person as she is being updated. Then the process is repeated with the next person, assuming that the newly calculated ranks are correct.

What this means is that it takes several iterations to converge on the "right answer," but it will get there.

This also gives us a little more flexibility when it fails for one reason or another (such as when someone whens all of his games).

Weighting

One of the nice things about taking the natural logarithm of the likelihood function is that it makes weighting substantially easier. We can multiply each term by some factor depending on how heavily we want that game to be factored compared to the others.

Exponential functions make the most sense for calculating weighting factors and can be used for things such as half-life equations.

Possible weighting factors will be evaluated here.

nachtrabe note: I am not partial to any of these in particular after Age and Rank Confidence--simply throwing them on the table for consideration in the interest of equal treatment. I actually consider some of them very bad ideas.

Recentness of games

Older games should be given a lower weight than games that have been played more recently. The most logical way of going about this is probably to give a game a "half life" such that after a certain period of time it's weight is half of what it was before. So if the halflife is 30 days, then after 30 days the game will have a weight of 0.5, after 60 days 0.25, after 90 days 0.125, etc.

Rank Confidence

The more confident the system is in the person's rank (i.e., the lower the variance) the higher the weight the game can be given.

Number of Games Played

There could be a half-life relating not only to how old a game is in absolute terms, but in terms of how many games have been played since that game took place.

  • Advantages
    • It makes a kind of logical sense--people don't just improve over time, but with the number of games they play as well.
    • Reduces the lag between rank improvement and it being reflected in the rating.
    • Encourages people to play a lot of games when they want their rating to improve.
  • Disadvantages
    • Will cause a form of cheating where people will play a lot of "easy win" games after losing, to reduce the weight of their lost game.

Handicap Penalty

Give higher handicap games a lower weight than even games.[1]

  • Advantages
    • Encourages people to play games against lower-rated players.
    • Acknowledges that handicaps are not quite transitive and that 1 handicap stone is not quite the same as one rank.
  • Disadvantages
    • May negatively affect the rating system that is based on the assumption that a handicap stone is roughly one rank difference.[1]
    • Even games will have a stronger weight tied to them.

Rank Difference Penalty

Assign a lower weight for games where the rank difference (even given handicap stones) is "too large."[1]

Same advantages and disadvantages as for the Handicap Penalty

Include faster time control games, which receive less weight than a classical game.

tderz: Please check out this paper [ext] Rating Formula – Better than Elo? Jeff Sonas discusses & critizises FIDE-ELO and gives good attempts for improvement: different K-factor (24 instead of 10 proposed; is = 32 in Go) The relevant parts for the Go server is suggestion #3: Include faster time control games, which receive less weight than a classical game, the chapter Rapid and Blitz and the last table, titled Suggested coefficients to use for any time control. Please not that the K here is not the constant K from ELO, but the (arbitrarily) weighing factor for the different ELOs (Standard, Modern, Blitz, Rapid) which could be form someones combined ELO.

Win Types

Assign different weights for different kinds of wins/losses, e.g., a lower weight for a game that is decided on time, or a lower weight if the game is decided within 6 points.

  • Advantages
    • Discourages blitz games where a player tries to win on time rather than on the board.

tderz: Why does it discourage? The player who wins (this way - pushing s.o. over time) gains rating points. If that was the incentive to play a/o win, then it is an encouragement. The problem with the statement blitz games where a player tries to win..(1)on time ... (2) rather than on the board is that no player knows in advance whether he'd win on the board (2) or has to try to win on time (1) (if possible). I rather think that the causality chain is different here:

  • first you try to play and win (by playing well and within the time limits), the player does not care much whether s/he'd win by (2) or (1).
  • once the player realizes that goal (2) is not realistically possible or probable anymore (in the short remaining time),
  • s/he should then aim for goal (1) and pose the other player in difficult positions (for the short or given time). This is standard practice also in even games. Play riskier in lost positions. It works even when one is oneself slightly behind in time and, of course better, if ahead.

nachtrabe With a system like this in place there is no incentive to start a fast game with the intent of running your opponent out of time. Hence, it discourages games where the intent is to drive your opponent out rather than win on the board. I seem to recall one player on KGS who would set up games with only 10 seconds per move and then just move randomly until his opponent ran out of time. That is what it discourages--setting up the game with the intention of running your opponent out. Of course, trying to win on time after having lost on the board is a separate concern.

tderz recall one player on KGS ... just move randomly. These people like a certain hahn-xxxx (x=number) are short-term nuisances and simple put on almost everybody's I-do-not-want-to-play-with-you list. Hence, accounts like this do not come far in ranking and do not form a long-lasting problem IMO. On the other hand, I must admit, that I played some people which prepared the board with a close to zero seconds time-setting and somewhere in the mids of the average 240 moves of a Go game, I thought too long and faltered. Then I forgot to put them on my no-play list (which I do not check actively anyway before accepting a game).
Please notice, that the problem you address is put nicely in proportion once you weigh the short term games much less (I suggested this previously). If a 1 minute basic time game is measured in comparision with the basic time of a 1.5 hour game it should contribute only 1.1% to the overall rating. If you want to add byoyomi, then there are at least two choices: calculate with the time

  • (1)which was available or
  • (2)which was used.

In Chess they use (1) (i) because, as far as I know, some kind of byoyomi is not used; ii) Blitz , Rapid (even when for match decisions) are not counted for FIDE-ELO anyway) and you yourself also introduced the argument here that (2) would be an "ex-post-facto"rating calculation (all are now too, but not including used time) give rise to several (small but significant) possible manipulations. With method (2), nobody could predict beforehand, how much a certain game result (win, loss, jigo) would contribute to the rating points. Method (2) adds to complexity while not providing clarity. However I see here two extremes which might make both (1) or (2) feasible:

  • i) if the basic time setting is e.g. 90 min, then any byoyomi below 1 minute (the longest byoyomi I know of) is roughly 1% of the basic time. Generalizing, I suggest that games with say byoyomi time (BYT) < 2% of basic time (BAT) (to include 60/1 games) should use only method (1), the used time for weighing the rating contribution.
  • ii) any other time setting including byoyomi BYT > 2% BAT should use method (2). An example is BAT=1 min, BYT=10s; BYT/BAT=17%.
  • ii-a) if method (2) is used both player times should be added and divided by two (I guess that has been suggested alreday, for completeness I add it here). After all, you try to think in your opponent's time as well.
    • Reduces rank changes based on 2 point losses.

tderz: I guess "smaller than 2 points" is meant. This proposal will sound absurd to high class professional players (I guess).

  • The top player, being able to win with a probability close to 99% (in his/her own estimation) with a sure 0.5 pointer OR only most probably (> 50%) by many points (10, 20, 30 ...80, 90 points?) would then be inclined (forced) to go for the lower winning percentage in order to get his/her rating points? (Actually s/he had to find the optimum of probability*(your rating penality for a win < 2 points)
  • Secondly this proposal shows simply a preference vs. different [ext] playing styles: an Ishida Yoshio in his prime time would be penalized, fighters as Jiang Zhujiu would profit (if they improve and increase their winning ratio, they would profit by more rating points from it).

It might compress the ratings into a smaller scale and penalize better players in general, because they are more capable of correctly judging to win safely by a safe margin (where they then get less ratings points for).

nachtrabe There's a difference between a pro player and a 20k winning by a small margin.

  • Disadvantages
    • A loss is a loss.
    • Encourages someone who is losing on the board to delay and lose on time instead.

tderz: correct analysis

Anchoring

So now that we have a "neat and cool" method of determining someone's rank, one of the problems that comes up immediately is that of "anchoring"--how to attach it so that it doesn't drift all over the place (and, consequently, keep them corresponding to a real-world system such as the EGF or AGA rankings). The MLE will keep the distances roughly constant (so that two people who are ten ranks apart will remain ten ranks apart) but it will not keep the values of their ranks locked down.

The solution here is to apply some form of stabilization function that "locks down" the ranks. Alice and Bob may stay 10 ranks apart, but there is a huge difference between them staying 10 ranks apart with one rated at 20k and the other at 30k and one rated at 200k and the other at 210k (or one at 9d and the other at 1k).

There are a few different methods of "rank-anchoring." For the examples, the test set of Alice (30k), Bob (20k), Claire (15k), and Darla (9d) will be used.

Note: All of this written by nachtrabe, conclusions are his own; please comment if you see gaps or flaws in his logic, or think of advantages/disadvantages that he didn't.

Note 2: No decision or voting is needed on any of these just yet. This is just to put the ideas into peoples minds and start discussion over what might or might not work.

Minimum Rank, Maximum Rank

System: The ranks are not allowed to go below a certain value or above a certain value. If any rank does, then the ranking algorithm adjusts all of the ranks so that they fall back into the proper range with the outlier lying on the boundary.

Example:

Assume that our minimum rank is 33k and that ranks are drifting downward (downward drift is what I'm seeing in simulation, though I think it could drift either way in a production system).

The system causes Alice to drop to 33k, leaving Bob at 17k, Claire at 12k, and Darla and 6d, but when the system automatically causes the entire system to drift down by -1 (so that Alice would be 34k, Bob 18k, etc) it adjusts to set Alice back at 33k, etc.

Advantages:

  • Doesn't require any complex calculations.
  • Can be done on the fly without keeping a list of every players' individual ranks.
  • Robust to the underlying distribution.
    • Only the range of the ranks has to be known, nothing else.
  • Keeps the ranks neatly bounded.

Disadvantages:

  • Requires that the minimum and maximum ranks used be relatively close to the minimum and maximum ranks of the actual players.

nachtrabe: Note that in testing so far the maximum rank requirement has proven only a theoretical concern, the minimum rank has been dominating. This will probably not hold in production, or when the test pool exceeds the 30-10k range, but I figured I should mention it.

  • Very sensitive to single individuals. One outlier could shift everyone's scores.
    • If the minimum is set at 33k and a 40k-equivalent player comes along, that player will be adjusted to 33k and everyone's ranks will adjust up by 7k. A 1k will now be a 6d.

Conclusions: I like this system for testing (when I have absolute control over the test set), but think we need to find another method for when we go live.

Median Rank

System: The median (middle) rank of the entire player population (or those with confident enough rankings) is shifted to some preset value.

Example:

The median rank of our four players is the average of the middle two: 17.5. Let's assume that we set the median at 15k. Now everyone's rank improves by 2.5 (Alice is 27.5k, etc), but will remain stable.

If a shodan player joins, Claire will be on the median. Her 12.5 rank is 2.5 too high, so everyone's ranks gets adjusted back down to Alice at 30k, etc.

This shifting is obvious in a small test set, but will be barely noticeable in a large system with a lot of players.

Advantages:

  • The range of the ranks doesn't have to be known a priori.
  • Robust to the underlying distribution.
    • Allows the range of the ranks to be just about anything.

Disadvantages:

  • Cannot be calculated "on the fly" in the same manner as minimum and maximum bounds can be, it will require taking the ranks aside separately after each round of calculations and figuring out what the adjustment factor needs to be for the next round.
  • Likely to drift naturally as the kinds of people joining change, so it will require tweaking to keep everything in line.
    • These adjustments are going to cause small "jumps" in people's ranks.
  • Requires an idea of what the median of the ranks is going to be.

Conclusions: Good system for actual production assuming that an administrator is willing to update it every so often to ensure that the ranks stay "locked on" to a real world system (how we will know this for sure is beyond me). These discrete jumps in rank when that number is shifted also leave something to be desired.

Side Note: The median rating of the EGF players is 1477 (about 6k) and Tukey's Trimean (another nonparametric estimator of central tendency) is about 1455. I'd anticipate for an online server it is going to be much lower--more lower ranked players and the EGF ratings don't even take into account 21-30k players.

Known Rank Player

Take a player (a bot would be ideal, I'd suggest gnugo, but the bot choice is largely irrelevant) of known rank (use, say, KGS's score for the same bot). Adjust so that the player's rank doesn't drift but stays constant at a particular value, then adjust everyone else's rank by the same amount.

Advantages:

  • Allows for gradual shifts--no jumps unless the bot is changed.
  • Requires zero knowledge about the underlying distribution and remains robust to it.
  • Self-maintaining.
  • Can be "fine-tuned" to be a certain distance from a given server, e.g., if you want it to be 2 stones stronger than KGS, take the rank on KGS and add 2 to where you set it.

Disadvantages:

  • Requires a lot of people to honestly play that bot (a few spoilers aren't a problem, so long as that the majority of games that it plays are honest).
  • Needs the bot to remain up fairly constantly (in order to play a lot of games) with the same settings.
  • Would probably take maintaining the same bot with the same settings on another server, such as KGS or NNGS.
  • If people figure out which bot is being used, it could cause problems, particularly because some people get very good against bots.
  • Requires a relatively large group of people.

Conclusions: Another good method for production systems, but the complications involved need to be weighed.

Evan: My mathematical instinct suggests that it would require quite a bit of conscious effort to really abuse this system. Over long periods; 100+ games, the amount of "cheating" against the bot would be pretty much constant. Small fluctuations wouldn't really matter, as these fluctuations aren't enough to sufficiently affect the rankings. I really don't think there would be a need to hide the fact that the bot was being used as an anchor. In fact I would probably call it "anchorbot" :)

nachtrabe Unless someone creates a bot intentionally designed to tweak with the ranking algorithms and plugs it into an interface so that it can continually play the anchorbot. :) Over the long term, such would probably fail miserably (ranks are not independent), but over the short term it could dramatically affect ratings. I agree that I don't think it is feasible for one person to do it, but I am still concerned about someone building an automated system to engage in such.

One idea to strengthen this is that several such anchorbots could be created, that would help stabilize things and increase the difficulty of an attack. Another choice would be to limit the number of times an anchorbot will play a single individual/IP address in a given timeframe or some other constraint along those lines.

Root Finding

http://homepage.mac.com/ap_llywelyn/.Pictures/GraphOfDerivative.png

Finding a zero is more difficult than one would think it should be. Here I will discuss four different methods of determining zeroes in the MLE that I've worked with, along with observations on each.

For those test players I've looked at so far, with the test set data, the function itself looks a lot like the [ext] complementary error function--a somewhat linear region in the middle, then two regions that are severely nonlinear as it transitions in to a linear region with a slope (the second derivative of the likelihood function) very close to zero.

Test data set is a group of 100 virtual players who have played matches against each other. Further testing needed to check on scaling.

Bisection Method

By framing the zero so that the function is negative on one side and positive on the other, we can keep splitting it in two and finding which half the zero lies in. Lather, rise, repeat.

Pros:

  • Absolutely certain. So long as the assumptions are met (positive on one side and negative on the other, etc) it will always converge on the root.

Cons:

  • Slow. Unbelievably, painfully slow.

Secant Method

Approximates a line and uses that to estimate the location of the zero. Uses two initial guesses and calculates the slope, which it then uses to find the next guess.

Basically the same idea as Newton-Raphson, without needing the explicit derivative function.

Pros:

  • Fast, has superlinear convergence on the root (assuming the initial guess is sufficiently close). In the test group it runs in about 1/3rd the time of the bisection method.

Cons:

  • Requires a guess that is close to the root to start out.
  • If the function has a slope that is close to zero it can trip up and have difficulty locating the actual root--settling long before it actually gets there.
  • Tends to misbehave with badly nonlinear functions or in regions of the function that are extremely nonlinear. It will take too long to get to the root and/or require guesses that are entirely too close to the root for comfort.

Framing + Secant

Use the bisection method to get relatively close to the root, then uses the Secant method from there.

Pros:

  • Fast, though not as fast as straight secant. Takes about 50% of the time as straight bisection on the test set.
  • Certain so long as the bisection frame is close enough to the root for the secant method to work

Cons:

  • No clear indicator if the frame isn't quite close enough to the root.
  • Still depends on the function being approximately linear.

Ridders's Method

A bracketed approach, similar to the false position method or the bisection method, which factors out a unique exponential function to make the residual function linear. It then closes in on the root in a manner similar to the bisection method--framing the root and isolating the region that it is in.

Pros:

  • Fast, on the test set it has equivalent speed to Framing + Secant (it also has superlinear convergence).
  • Tends to behave better with nonpolynomial functions than the secant method, does not assume local linearity.

Cons:

  • Not as fast as Secant.
  • Implementation is more difficult than any of the other three methods.

[1] tderz: The rating calculations are distorted by these means. Other incentives for playing more teaching/handicap games can be invented. I wrote s.th. on the discussion page.

nachtrabe: Would you mind elaborating a bit on how it would distort it?

You seem to be mathematicians, hence I should be more careful with my words. Yet, I compare it with raising import/export taxes on products (economy). It affects all associated products in all countries. It might be that - while reading the entry - I thought that only one player - the stronger one - would get a lower weighting. I assumed that would distort. Now I realize that was not meant. On the other hand - in addition to my previous disliking - I do not understand why this lower weighting should lead to more handicap games?

Benjamin: Did you have a look at [ext] http://www.pem.nu/mlrate/ ? This seems to be the program used by NNGS.

nachtrabe: Thanks for pointing that out, I hadn't seen it before :-) I just went over it briefly, but am not going to be able to use it, at least for the moment, because of the GPL. My ideological issues aside, I am uncertain what license UGS will finally fall under and I want to make sure that when the rank simulator is complete it will be compatible. (I plan on developing this into a full simulator that can be evaluated over time).

That having all be said: I am kind of comforted by my run through of their code, actually. They use exactly the same approach I am describing here :-) The major differences are that they are using the Bisection Method where I am advocating Ridder's Method and they use 0.5*S^Rdiff (equivalent to 0.5*exp(ln(S)*Rdiff)) where I use 1 / (1 + exp( a * Rdiff ) ), but that's about it.

Anonymous How about just adopting an existing rating system, such as Glicko rating for example?

nachtrabe: a) This is pretty close to how the AGA does it (this is adapted for online systems; they use a bayesian model that uses prior probabilities--which is very good for a RL system but is less critical in our case where it is expected that many people will play a lot of rated games). b) This is the standard (and therefore existing) rating system for online go servers: it is identical to how IGS, NNGS, and KGS do it and I believe it is how DGS does it.


This is a copy of the living page "UGS Ranking System" at Sensei's Library.
(OC) 2005 the Authors, published under the OpenContent License V1.0.
[Welcome to Sensei's Library!]
StartingPoints
ReferenceSection
About