Center Search Algorithms
Table of contents | Table of diagrams pattern numbering O is on g O is on c |
The problem with center search is that the pattern can occur anywhere on the board. Brute force is out of question
Brute force
Search pattern accross board in every position.
Time order:
#games * #moves * square(board-size - pattern-size)
E.g. on a 19x19 board a 7x7 pattern can fit in at 12*12=144 positions. Multiply by 20000 games * 200 moves = 4m board positions you'd have to do 576 million pattern comparisions. Multiply by 16 to search for all variations and it should be clear that brute force is not the answer. Even if you count for end-position comparision you'd be close to 3 million pattern comparisions for the first pass. Not really an option, is it?
dnerra: Why not? You can conveniently test a whole line at the end position by a single bitmask comparison (s.th. like ((end_position_line & mask1) == mask2)), and most positions should already get rejected by the first line. Doing 3 millions such tests seems feasible, no?
kosh: There are approximately 3^361 possible patterns in the game of Go 19x19. That is about 10^172. If you can search 3 millions positions per second, you only need about 10^166 seconds, which is about 10^158 years, to search through all games. Suppose your computer is 1.000.000.000.000 times faster, than it still takes 10^146 years! calculation: 3^361/3000000/3600/24/365 = 5520346608924147891526770613602370876033825626276748515817735879 8452895635522783435173386887965230018430038650483006026806681212 5233228473158130497515873062312
Well Known Algorithms
In string search (which is roughly a 1-D analog of the pattern search described here), there are well known techniques for speeding up the outer loop of the search in case of a partial (but not complete) match. Suppose your search is structured as:
- Loop over all offsets of the pattern on the board, doing:
- Loop over all positions in the pattern, doing:
- attempt a match at one position.
- Loop over all positions in the pattern, doing:
... then the idea is that a failure to match in the inner loop shouldn't just break to the outer loop and increment the outer offset by one. Instead, the place at which the failure happens can give you the information to increment the offset by more than one.
This is standard thirty-year old stuff (in the 1-D case). Do a web search on "Boyer Moore string search" and "Knuth Morris Pratt string search". Try this tutorial: http://www-igm.univ-mlv.fr/~lecroq/string/
Now in the 2D case it gets a little more complicated, and I recommend you understand the 1D case first as it provides the necessary background for all the later work. But again there are well known algorithms. You might start your reading here: http://citeseer.ist.psu.edu/context/36400/0
I think I implemented something similar to a 2D string search (a time ago) but I never did something with it. In a whole board search, the algorithm is able to recognize any kind of pattern (corner/side/center) in one 'scan' with a tree-like structure. I think it's very fast but I didn't do any comparisons. One drawback is db size. It doesn't work with hashes as I don't believe there're scalable. The learning algorithm constructs a pattern tree. A tree node is a stone or edge and stone positions (of a certain rectangle) are relative to each other (e.g. color-change, x+2, y+1). The pattern search algo then just uses this tree (but has to do some trivial additional checks). Anyway, if anyone is interested, let me know and I'll send you the C++ source code. It's fully regression tested :-)!
I added a page which contains the algo description. You can find it via my homepage.
Hashing
Arno: just had this idea - feedback?
We make a hash index, which works like this: every pattern is assigned a number. For simplification let's assume every position are 2 bits and we search only for 4x4 patterns. Then index the pattern as:
a .. most significant 2bits, p .. least signicificant 2bits. Reasons for the ordering see below.
E.g. assuming black = 01, and white = 10, empty=00 and black stones on a,c,i and white stones on b,e,m the pattern-id is 0110 0100 1000 0000 0100 0000 1000 0000 or 0x64804080. This pattern-id is the key to our index.
For each move in a game we calculate all pattern-ids that occur anywhere on the board, though we only store the first occurance of a pattern in a game.
Let's assume on average each move creates 16 patterns. (e.g. the first move creates patterns where the black stone is at each position a-p), and so on for each move (note: capturing stones could create more patterns).
The result is an index which holds the information whether a given pattern e.g. 0x64804080 occurs anywhere in a game. Upper bound for #patternids: 16 patterns/move * 200 moves/game * 20000 games = 64mio pattern-ids (I'm sure the number would be lower).
For each entry we store game-id (2byte). So the index would need 64*2 = 128MB on disc. Add some manage info as well and we get close to 150MB. If we store move-numer(1byte) and some flags (1byte) as well, each entry takes 4bytes which would grow the table to 300MB.
So now we have a 300MB index with the pattern as key. We could retrieve any matching games in O(m) - for center search!!
mdm: Unfortunately this is not quite correct. If you use the pattern-id as a key directly you don't have a hash but just a fancy array. Thus you need 2^(32 == bits of the pattern-id) or at least 3^16 (initially empty) entries. Otherwise you cannot address the entries directly. Also this assumes that collision handling is via linked-lists (otherwise you would need a secondary hash-function and could not complete your search in linear time). But for those you need pointers. For most modern programming languages that means an extra 4Bytes per entry, even for empty entries. Assuming that an empty hash entry would be a null-pointer (4Bytes) and and a filled entry a linked-list (surely more than 4Bytes), gives a hash size of at least 4*3^16 = 164.2MBytes not counting any data contained in the hash.
I'm not saying this idea is completely bad, but be aware that the index-size might be well over 1GByte.
Other pattern sizes
Smaller sizes: here our numbering scheme takes effect: assume a 3x3 pattern: all keys using 0xppppxxxx (p .. pattern bits, x .. anything) would yield match results. As the index is sorted by key the result is one block in our index.
Larger sizes: tricky. Basically we would have to merge index results. E.g. a 8x7 pattern can be covered by four 4x4 patterns. All four patterns must occur in the game. As each pattern-id has a list of matching games which is sorted by game-id, we just have to look at each result list of those four 4x4 sub-patterns and make an intersection of the lists (as they are sorted, this operation is cheap). That should cut down the results. Granted, we still have to look at each result to see whether the pattern actually occurred (after all, the four 4x4 patterns could occur anywhere on the board). But it still should be fast enough for most searches.
I'm sure this solution could still be improved. Especially in terms of index size. --Arno
UlrichGoertz: This sound interesting. I'll add a few remarks which came to my mind when I read through the text; maybe I can say more after thinking more about this.
First, I am not sure if I understand the calculation of the size of the hash table. Some stupid remarks which Arno convincingly corrects below, deleted.
Another point is that checking whether a candidate (with the right hash code) really is a hit becomes more complicated than checking a candidate which has been found by end position comparison, since the hash code gives no information on where on the board the search pattern actually can be expected.
Finally, it will take much longer to extract all the necessary data from game records in SGF files, but that of course seems a natural price one has to pay for a faster search function.
Arno: about size of the "first list" (hashcode -> pointer): there are less then 64mio possible patterns: in a 4x4 there are 3^16 ~ 43mio possible patterns. Plus, instead of doing 16 searches for all variants (mirror, rotate, color reverse), we assign all patterns which are variants of each other the same id! I'm not sure if the resulting amount of patterns really is 43mio/16, but I think it's safe to say that there will not be more than 5mio unique patterns. Benefit: smaller table, and by searching for one pattern we automatically search for all variants! So, 5mio * 4 bytes = 20MB for the first table (as we store all id's in the table whether or not they have any entries associated with them, we don't need to store the pattern-id itself in the table, but can use the pattern-id as offset into the table).
On average each pattern entry will have ~13 entries (63mio/5mio). Of course there are some patterns (e.g. empty pattern, patterns with 1 stone) which will have 20000 entries, as they occur in every game.
About your second point: verifying candidates: I don't think it becomes harder. We use the methods you already use in Kombilo (i.e. endpoint comparision etc.) - remember: we are just checking candidates. Furthermore, if we store the move number for the first match (which results in a larger table), we already have a good starting point. For larger patterns, the highest move number is our first try and we work our way upwards from there. Actually, if the endpoint stores the move number for each intersection (an idea by Arend^{[1]} you mentioned in SearchAlgorithms) maybe it could be further improved.
And yes, indexing the games would take longer. But even if indexing a single game takes up to 4 seconds, indexing 20000 games is done in less than a day - a price I'm willing to pay :o)
Furthermore, those indexes could be shared for standard game collections - so not everyone has to recalculate them!
I agree that my idea is very rough and needs further thinking. In the end, I reckon someone just has to try it out (i.e. code it) in order to see if it's feasable or not.
On a side note: end-position diagrams have "jokers" in them, where during the game both a black and a white stone occurred. I haven't looked into Kombilo, to see whether those jokers cause much harm, but one could avoid them: those jokers can only occur, when the other player played on a point where a capture has taken place. If you have a look at pro games, there are not so many captures at all. Maybe 6-8 throughout the game. I.e. if you store 6-8 endpositions/game (at each point where a capture occurs), the need for jokers in the end-position disappears. With some tricks, ko captures would not need to be dealt with every time, but only once for each side. I don't know if the trade-off "no jokers" <> 6 times as many end-positions is a good one though.
UlrichGoertz: I agree with your remarks on the number of patterns actually occurring. It's true that your method could be combined with end position comparison. That means that it wouldn't be an improvement for extremely frequent patterns, but those are of course not very relevant, anyway.
So this really sounds good. I agree that it should be tried out ... I won't have time to do it in the next one or two months (want to finish Kombilo 0.5 first), but if someone wants to do it using Kombilo as a basis, please go ahead (and send me a patch afterwards :) ).
As for the jokers: Since they are quite rare, as you say, I think they don't cause much harm. In any case, I'll think about your suggestion.
Arno: yes, I think in Kombilo's case jokers don't matter much. But while I was thinking about other index methods jokers were an annoyance sometime. That's when I came up with this idea. I thought I share it, maybe someone else could use it :o)
dnerra: Thinking about it a bit more: It is probably not too much of a help for smaller sizes. For a 3x3 pattern, you'll have to check 2187 patterns, each of them has on average 13 matches -- so you probably have to scan the whole database. But hey, who other than someone stress-testing search-algorithms will want to search for a 3x3 patterns anyway!
As for kombilo, I am actually not sure whether it would be seen as an improvement by most users to spend 300 MB of hard disk for a sligthly faster search, given that its typically reasonably fast already. That is quite different to the situation on a server, of course.
Arno: I did a first preliminary examination:
4x4 contains 43046721 patterns. Taking rotation, mirror, and color reverse into account 4x4 has only 2700373 distinct patterns, of which 88800 are illegal positions (i.e. containing breathless stones) thus making 2611573 possible 4x4 patterns.
As the index number of patterns covers the whole range (43mio), I think the first table should store index number + offset into second table (makes 8 bytes / entry) - thus the first table is about 20MB.
I also indexed 2323 games with an average of over 200 moves/game. On average they created just over 2000 unique patterns/game (unique means unique per game, not unique for all 2323 games) or 4693464 pattern-entries in total.
That suggests that for 20000 games the second table would contain ca. 41mio entries, not 64mio as speculated above.
Some detailed analyzing:
- 891918 distinct patterns found in 2323 games (out of the 2.6mio possible patterns)
- 832428 patterns are found in 10 games or less (i.e. 59490 patterns are found in more than 10 games)
- 55148 patterns are found in 11-100 games
- 2316 patterns are found in 101-201 games
- 224 patterns are found in 1000-2000 games
- 61 patterns are found in more than 2000 games
The following table shows number of stones in pattern, how many patterns were found with that many stones, how many games were found on average per pattern:
0 : 1 : 2323 1 : 3 : 2323 2 : 42 : 2017 3 : 296 : 985 4 : 1896 : 229 5 : 8528 : 58 6 : 28079 : 18 7 : 62519 : 8 8 : 101444 : 5 9 : 132218 : 3.4 10: 147664 : 2.8 11: 143522 : 2.5 12: 120308 : 2.4 13: 83244 : 2.5 14: 44238 : 2.8 15: 15334 : 3.5 16: 2582 : 5
Basically this says, that any pattern containing one or two stones is not worth indexing (there are two patterns with two stones which have 393 and 924 associated games, the other 40 have more than 1000 games each).
Useless facts: there's a 16 stone pattern which occurs in more than 50 games, first unique patterns appear with 4 stones.
Basically, these results suggest: adding an index for e.g. 7x7 patterns. However, in order to limit the table size, limit it to 5 stones or less in the pattern.
In combination with a 7x7 pattern, one could cut down the table size of the 4x4 patterns by omitting any pattern <4 stones plus removing the 10-16 stone patterns and adding 10+ patterns instead (i.e. 10-stone patterns that match for all additional 10-16 stone patterns that can arise from the base 10 stones) - that would save close to 1500000 entries (~30%) in the table.
Kosh: Arno, I see a few possible problems and want to share them with you. I know it's a "center search algorithm", but these questions could be relevant for further investigation:
- How about ladders affecting patterns on the other side of the board? (not edge related)
- What can you do with patterns near the edge?
- How about mirroring and rotating on the edge? Although the white stone is on the same spot on both boards, a different rotated mapping would cause the stone to be mapped on a different place.