Arno: I have been thinking hard about how game positions can be searched fast&effective. Kombilo has some nice ideas, but GoBase (Jan's site) seems to be top of the crop. Unfortunately, Jan does not share his ideas. Anyone is welcome to improve on the ideas presented below. The pages are written in form of "thinking aloud". Any idea welcome ...
- up to 20000 games
- on average 200 moves per game
- search for positions in under 1 sec
- don't use more than 20MB RAM
- but can use (almost) any amount on harddisc (let's say <1GB)
I.e. algorithms have to be fast and use few resources. Search time is crucial if you use it for web sites (many searches can happen within a short time)
We can distinguish between:
- fullboard search
- corner or side search
- free search (center search)
I think 1 & 2 are easily solved. But 3 is the real killer.
Note that mirror positions (rotate and flip) and color change make up 16 variations of each search pattern. But let's just assume we do 16 separate searches (i.e. one search for every variation).
UlrichGoertz: Here is a brief description what the current version of Kombilo does: http://www.g0ertz.de/kombilo/tutorial.html#searchalgorithm . It basically uses the methods which Arno described, i.e. hashing for full board and corner positions, and end position comparison; then it plays through the game to confirm the occurrence of the search pattern (and to find out the move number, when it occurs).
The end position comparison is done in a brute force way right now; this could probably be improved by using a two-dimensional pattern matching algorithm (e.g. some variant of Knuth-Morris-Pratt). I haven't got around to implementing this yet, and I have no idea how much of an improvement it would actually be.
Another modification which is useful at least for certain search patterns is to store for each point on the board the move numbers when a black/white stone is placed there. This gives (for each game and) for each point and each color a union of intervals, and instead of playing through the game the occurrence of a pattern can be confirmed by computing the intersection of the corresponding unions of intervals of the relevant points and colors. (I hope this description is understandable, if not I will be happy to elaborate.) This is advantageous for small patterns with relatively many stones; on the other hand, for large patterns with few stones, it is usually faster to play through the game. I have played around a bit with this idea, and Arend Bayer renewed my interest in it when he sent me a few suggestions how this can be further improved. So maybe some day I will change Kombilo's search algorithm along these lines.
By the way, I am not sure how much faster Jan Steen's GoBase really is. I think that Kombilo is a bit slower than what the numbers on the performance of GoBase in Jan Steen's report at the Myong-ji conference (see http://gobase.org/studying/articles/myong-ji/) indicate, but not very much. In some cases Kombilo even seems to be faster (e.g. searching for the empty board); of course the GUI operations which Kombilo has to do should not be included in the search time for these comparison purposes - they are included in the search time which Kombilo shows in the progress bar, though. Unfortunately, Jan's report doesn't give any numbers for a center search.
Apart from the interesting question of finding a good search algorithm, from a practical point of view one can of course limit the time used by stopping the search after a certain number of hits has been reached. Especially if you want to offer a pattern search engine on a web site, this is a reasonable way to proceed.
Jan van Rongen: I ran some tests in the past (do not remember whether I sent them to Ulrich): Kombilo on 23,000 games comes pretty close to 1 sec for full board search if one optimizes the directory structure (windows XP, other windows were horrible). All other searches on the very same database are generally under 3 seconds. I have to make a pattern with many wildcards to push the free search over 5 seconds. So for me the algorithms are good enough, especially when the next search on the subset is performed in microseconds. How do you think we found all those Hikaru no Go games? Of course with Kombilo!
Arno: I doubt that Kombilo is as fast for center searches as GoBase. Moreover, if you have a look at Jan's text you will see that apparently the search speed is independant of the database size (Kombilo's search isn't). Moreover, GoBase creates a whole variation tree (not only for the next move, but up to 30 moves deep).
I agree that doing a fullboard search is easy, when using hashes (as discussed). Even corner search is feasable and fast. But, center search seems rather awful. After all, you have to check for the pattern at every position of the board. You can easily do the math yourself.
I'd be interested in good ideas on pattern searches. Maybe once we have solved the basic problems, we can think up advanced algorithms using wildcards etc.
p.s. please don't misunderstand me: I like Kombilo, but I think there's room for improvement.
UlrichGoertz: I agree that GoBase seems to be generally faster, and I certainly agree that there is room for improvement in Kombilo's search algorithm. But I would still like to know how much faster GoBase actually is, resp. in which situations it is faster. Could you point me to some evidence why GoBase is faster for center searches? You are right that Kombilo doesn't build a "search tree" beyond the very next move played in the pattern. But Jan's text contains some numbers where the search tree feature is switched off in GoBase, too.
I'm not sure if I understand your remark on independence of the database size. Kombilo's search time grows linearly with respect to the database size. Doesn't that seem to be the same for GoBase? I.e. in both programs, the number of hits per second should be essentially independent of the database size. Please correct me if I got that wrong.
Arno: Ulrich, I think you are wrong. Check Jan's paper: hits are successful matches. So he says that whether the database contains 10 games or 30000 games his algorithm will always need only 0.26sec to find 2 matching games. OTOH Kombilo has to look at each game in order to decide whether there is a match or not. Thus Kombilo's time is O(n), but Jan's is O(1) with regard to n being number of games in db (or O(m) (m ... match)). Now, this is only possible if Jan uses some kind of hash or some other nifty data structure. What puzzles me most is that this search speed is apparently independet whether the pattern is fixed in one corner or floating around the board. Now, how's that? I don't have an idea how this can be accomplished. Somewhere in the paper Jan says that the site only takes up 250MB of memory, so he can't have too large indexes either. Currently I have no clue how this could be accomplished.
UlrichGoertz: In his paper, as far as I can see, Jan dedicates one graph and two sentences to the comparison of a medium size database (14407 games) and a large one (33828). The graph shows that the number of hits per second is more or less constant, if there are many hits. I am unable to read anything from that graph for searches with less than 500 hits. Then he says: "So, although the database doubled in size it didn't seem to affect the speed at which searches are performed." A few lines above, he says that the number of hits per second could be defined as the "search speed". We seem to draw different conclusions from this, which are both consistent with what Jan says. You think that even for search patterns with only a couple of hits the number of hits per second is constant, i.e. the search time does not depend on the database size. I think that from Jan's graph we can only conclude that the number of hits per second is constant for search patterns with many hits, i.e. in cases where we expect roughly twice as many hits in the large database. I find it a bit unlikely that by "didn't seem to affect the speed" he means that "it is clear from the structure of the algorithm that the search time depends only on the number of hits, even if there are only 1 or 2 hits".
By the way, in a database with only 10 games, 0.26 sec. for 2 hits would be pretty slow, I think :)
Furthermore, I still cannot find any information on GoBase performance for center searches in Jan's paper. Of course this doesn't prove that what you say is wrong, but I just don't find it self-evident. For corner and full-board patterns, the use of hash tables actually gets Kombilo's search time close to O(m), where m is the number of hits, since it only has to look at games with the right hash code. So the question of constant search time really only is problematic for center searches. Possibly GoBase can do center searches in O(m), m=number of hits, too - but I am unable to find an indication of that in Jan's article.
Be that as it may, of course a data structure like the one you described would be the best solution, so this is reason enough to try to find one. It's good if many people think about it, and I for one will continue to do so.
Furthermore, I agree that probably the GoBase search engine is quite different from Kombilo, and it is certainly built on a powerful algorithm. One thing I find very surprising, by the way, is that the GoBase search for the empty board takes very long (50 sec. on a PIII-500, 14407 games, search tree turned off). Kombilo is definitely much faster than that (also without hashing). Not that the empty board would be a particularly relevant search pattern, but this shows that the algorithms must be fundamentally different.
Please don't misunderstand me, either :) As you may guess, I spent some thought on these things, too, and I am very interested in any suggestions that might result in an improvement of Kombilo's algorithm. So I will be glad to follow any discussion which develops here, and will join in if I can contribute something. I'd sure like to see someone come up with new ideas.
Arno: ok, I try one last time, then we can concentrate on the problem at hand: on page 15 in Jan's paper (section I.6) the first table indicates how long it takes to search for specific patterns. The patterns are shown in the previous pages Figure 1-5. Now, I argue that Figure 1 and 3 (both searching half of the board) will take exactly the same time in Kombilo (not using hashes), namely O(n + m) (have to look at n positions, output m matching positions). As n>>m this is ~O(n). From the speed given in GoBase I deduce that Jan's search is only O(m). I.e. search for pattern in Figure 1 produces 2 match results (=hits in Jan's terms), Figure 3 produces 44 results (=hits). From this and other searches he deduces: GoBase is independet of the database size (no O(n) component), but dependent on how many results (=hits) are found (O(m)). I.e. even if you have 100000 games and search for a pattern that produces only 2 matching games (=2 hits) the search speed would be ca. 0.26 sec.
Jan's graph (comparing medium and large db) says, GoBase is independant of the database size, it produces 400 hits/second. I.e. any pattern matching less than 400 positions is done within a second, no matter how large the database size.
Furthermore, as you can see GoBase's search speed doesn't differentiate for fullboard, corner or side search. With side search you would aready have to do translations (i.e. more pattern comparisons/position).
So I argue that GoBase's search is fundamentally different and uses some kind of clever data structure. What puzzles me most is that he only needs 1KB/game. So it doesn't seem to be a straight forward hash index either.
UlrichGoertz: I agree that GoBase's search is fundamentally different und uses some kind of clever data structure. From what I know, the algorithm seems to be better than Kombilo's in many (most?) cases. On the other hand, I am not convinced by your arguments. All the tables in Jan's paper only compare different search patterns for the same database (I prefer to look at the table where the tree building code is switched off). It is true that in these tables the search time is O(number of hits). But since they are all for the same database, how could one see an O(number of games) component? As for the graph where two different databases are compared, it is true that the number of hits per second appears to be the same, if there are many hits - note that the line for the large database is much longer, since for most of the patterns there are twice as many hits. Are you really able to interpret this graph for 2 hits? And, seriously, if the search is really O(number of hits), without any dependence on the number of games, why is it so slow? It should certainly be possible to find 2 hits in a 10 games or 100 games database in less then .26 seconds.
So I think that there is not enough theoretical evidence to conclude that GoBase's search is O(number of hits). Nevertheless, comparing Kombilo and GoBase, it seems that GoBase is doing much better if there are only few hits. After all, the .26 seconds for 2 hits are on a Pentium I-233, and Kombilo won't even get close to that (without hashing). This does indeed suggest that GoBase doesn't have to look at each individual game to find these hits, but that there is some kind of clever index. (For searches with lots of hits, Kombilo's search time is closer to GoBase's, and Kombilo seems to be faster if there are extremely many hits.) It would be really interesting to see the GoBase search time for a few center search patterns.
I just made a small test with the GoBase pattern search engine which is available on Jan's web site. Of course, this is not at all reliable for measuring search time, and can at most give some indication. Anyway, here is what I did:
The point one has to observe is that the search stops after 100 hits - this usually makes the search time very short, of course. So I tried to search for a pattern with less than 100 hits in the whole database (Pattern 1, with actually no hits at all), and compared this with a pattern with many hits (Pattern 2).
I only tried each pattern twice, because I didn't want to put too much load on Jan's server, but the first pattern took significantly longer in both cases. As I said, this is not really reliable, but it may be an indication, that the GoBase search time at least does not only depend on the number of hits.
A final remark on the difference between corner/fullboard and side or center patterns: Of course, for the latter two, you need to consider translations, but that does not necessarily mean that the number of hits per second goes down. If the pattern is sufficiently frequent, in the Kombilo algorithm the confirming of candidates (by playing through the game) takes much more time than the search of candidates (by end position comparison). So for the side search pattern which Jan uses in his article, Kombilo's number of hits per second might be a bit less than the number of hits/sec. for a corner pattern, but definitely not by a factor of 11 (which is the number of translations which have to be considered). If you search for an extremely frequent pattern, with several hits in each single game (like the 2x2 pattern with a cross-cut), the number of hits per second gets quite high.
dnerra: I am sure Jan van der Steeen means linear growth with respect to the database size (the number of hits will grow linearly with the database, you have to at least one operation per hit, at which point the mathematician is satisfied to have proven that nothing better than O(n) exists ;-)). It is a pity that he doesn't share his ideas. The only hint is that he stores about 1Kb of information per game, which is a little less than kombilo (including its hash tables).
I am also undecided whether GoBase's center search is really faster than kombilo. Unless I overlooked something, all the examples patterns given in his paper are corner or edge positions. This makes any search algorithm faster (because you have to test only 1/19th as many positions), but you could probably also use some hashing for edge positions.
Looking at the code in Kombilo, it seems to me that it does not do the pattern search by a bitmask comparison (as I tried to describe in CenterSearchAlgorithms). Maybe I find time to try this out (and send you a patch, Ulrich).
UlrichGoertz: Kombilo does use bitmask comparison (at least in the current version). Of course, it may be possible to improve this further; one way to do it might be a special two-dimensional pattern matching algorithm, as I said above.
dnerra: I see; I checked a rather old version (0.3).