Database Search

Path: <= Statistical analysis =>

Go databases of games seem to become useful at the level of 10000 high-level (pro) games. (Recent databases contain five times as many professional games.) Fortunately there are now a number of such tools, and players everywhere can begin to derive the benefit.

There is something of a knack of posing good questions to a database equipped with a good search engine, such as Kombilo. One needs in general to choose sub-boards of appropriate size with some given stones. For example, for a joseki, you need to define a corner that is large enough, so that nearby stones already played don't have an impact. To find standard life and death positions, one often needs to use wild-card stones intelligently.

[edit]

Heuristics

Winning percentages are useful, but do not overestimate them, they are but one feature of databases. There is so much more to learn from database searches.

You may want to search time specific, there is much to be learned about historical changes in professional play and you may even track a first occurrence where a move was invented. If some sequence is not played anymore, this may be as important as new sequences arising and you will not recognize unless you search with differing time frames.

Wherever you come upon a junction with several branches, take a look at the whole board of the games in your database. You may learn something about choice which you usually can not read in the joseki dictionary.

Try varying the range of your search, even a stone 6 lines away may influence the sequences you are researching.

There are blocked configurations. They never arise in professional play because the way to them contains outright bad moves. Your database may not tell you why it is bad, you have to figure it out yourself, but it can lead you to the branching point where your play left common knowledge.

Do not believe the reason for a change is necessarily given in your database. A new idea may have been tried on a Go server or in some obscure local tournament first, spreading from there, leaving your database with only the change to notice, but without a clue about what made it happen.

If you recognize a change in playing practice, try to figure out why this may be better than the old version. Is it a later endgame move playable in sente now or does it leave some aji, which did not exist in the old version or something totally different? (Do not end with a joseki, look at the follow up.)

[edit]

Discussion

Stefan: There's something I don't understand about these database searches, and now seems like the right time to ask the experts. I guess the simplest way to phrase my problem is whether it's statistically sound to draw conclusions like this or not. At first sight the number of games looks big enough to make conclusions relevant, but have we checked whether we needn't adjust for a number of potentially disturbing factors?

Some factors that I can think of:

- A large portion of games involving a certain pattern tend to be generated by a small number of players, as Dave points out above. They don't call it Kobayashi style for nothing. :-) Dave compares the percentage with player X's win/loss rate with the pattern under analysis, but how about X's overall win/loss?

- The result of some of the games depending on some clear other reason.

- One of the players' position collapsing, an obvious pinpointable losing move, etc. Wouldn't you filter these out first? In a population of 200 games, only 4 games with such a calamity could swing the winning percentages from 51/49 to 49/51. Admittedly the percentages of database searches on these pages have often been more outspoken than that, but you see my point.

There are several math buffs hanging around here, so somebody must be able to reassure me.

Charles Well, if I don't like the statistics I see, I do a slightly different search, until I get something better.

Seriously, if it's statistical inference trying to prove something, the sheer number of games you are sampling becomes the dominant factor. Anything based on 500 games is much more serious than anything based on 50. And 50 games in one pattern is a minimum, really. Below that threshold you tend to believe what the strongest players play. It will be good enough anyway for strong amateurs.

The other factor that matters is getting a representative sample of games, across time and across the various schools of pros (Four Houses and Outside the Ki-ins). So a search through the Nihon Ki-in database (35000 games mainly 1965-1995) will be biased in ways that smaller databases with more varied geography avoid.

Anyway, the point is to find stuff that's interesting. I'm often looking for the stuff they don't play (blocked configurations), rather than the main lines; but the latter are now much more accessible.

Dieter: Earlier, I refrained from making a comparable comment. I'm supposed to be a maths buff but my days with the statistics sword go back some time. My question: given a pattern that does not give any particular disadvantage (supposing we know that), what's the probability that it lies within 5 percent above or below the average winning percentage (supposedly 50-50)? If that probability is about 95%, I think we can't disprove one of the patterns under study to give an equal result.

dnerra: Depends on the number of games, of course. Taking n random games, then 95% of the time the winning percentage for White will lie in the interval 50 +- (100 / sqrt[n])%. I.e. for 100 games it's 40-60, for 1000 games still 47-53.

ThaddeusOlczyk I would like to remind people that in any given sample it is not clear that the distribution is 50% Black wins and 50% White wins. White for example is typically stronger, but Black has an advantage. So the sample may be skewed. One should take this into account. When using Kombilo I always pick a point and put in a wildcard (so I match anything ) to see what the distribution.