As I worked to come up with solid defenses of my handicap system, I stumbled across a dilemma that, the more I thought about it, left me incapable of focusing on the matter at hand. Simply put, the problem with which I struggle is that rating systems appear to be rather useless. I will try to be concise, but the issue has been a bit hard to pin down in my mind, so forgive me if I ramble.
The starting point is this. Handicap systems can exist without rating systems, but rating systems cannot exist without handicap systems. If players get together and play each other regularly, they can, through trial and error, determine what fair handicaps are for them using any number of methods. I doubt that anyone would disagree that the proper handicap between two players should be based upon the differences in strength in the various aspects of their playing styles. Suppose we have two players, A and B. When Player A takes Black against Player B without giving compensation, B tends to leap upon slight weaknesses in A's opening to take a small territorial lead, and then outlasts him by avoiding fighting for the rest of the game, since she knows A is better at fighting. Normally, A would take two stones against B, but then A uses an influence-style game to force fights and always beats B. An appropriate solution would be to instead give A just enough points as compensation to force B to play more aggressively without forcing her to have to fight all-out. Thus, two players can decide through trial and error what is a fair handicap between them.
When more than two players are considered at a time, interesting relationships can be discovered between different playing styles. It is feasible that three players might have playing styles and weaknesses such that each of them loses to one while he beats another, forming a Rock-Paper-Scissors system. Each of these players should take an appropriate handicap against the player that beats him to compensate for the weakness that the other player exploits. Because each of these players beats one while losing to the other, it is reasonable to conclude that they are of equivalent playing strength, though their strengths are not equal.
Rating systems are numerical representations of handicap systems that enable players that have never played each other to make their first games as close as possible. A perfect rating system would enable any two players to play against each other with the perfect handicap as to make the game as even as is possible. Any adjustment needed in handicap would be needed only when one of the players grows stronger, and it would be reflected in the rating itself.
However, a perfect rating system is likely impossible. The idea of distilling a scalar value that represents a player's strength when there are so many overlapping areas that affect the game's final outcome seems more laughable the more I think about it. To measure a player's abilities, a tuple with an axis for each of several discrete areas of play (good luck determining what those would be... reading, counting, use of influence, sabaki, and even emotional maturity might be candidates). Each of these theoretically discrete areas of play would not only need to be numerically quantified (what is 1-dan counting as opposed to 1-kyuu counting?), but all the interactions among all the axes would need to be carefully considered to determine the proper handicaps to take.
Consider the Rock-Paper-Scissors situation described earlier. Three players, each of which consistently beats one while losing to another, would be considered as equals in a scalar rating system, when it is apparent that their strengths are equivalent but not equal. If it were possible to numerically quantify the different levels of strength of the players in discrete levels of play, it would be possible to demonstrate why each of the players beat whom he beats and loses against whom he loses.
I want to go on about this, but it has taken me a ridiculous amount of time to write this little, and my efforts to explain myself clearly are not bearing much fruit. If anybody else has any ideas about how my grumblings can be solved, or if they see issues that I have not brought up, I would be delighted for a continued dialog.
It is a well-known problem for rating systems that go strength is not necessarily a one dimensional variable. There is certainly a strong one dimensional component to it, as we can expect a 3 dan to defeat a 3 kyu in an even game regardless of their respective strengths and weaknesses. So if there are other dimensions, they are not large. One of the advantages, in my opinion, of the traditional dan/kyu rating system is the fact that the distinct rank steps are large enough to mostly smooth over the other dimensions to playing strength. If a 1 dan plays a 1 kyu, then their strength difference can be anywhere from 0.001 to 1.999, theoretically. Which means that their score, over 100 games, might be anywhere between 50-50 and 80-20 without being considered particularly unusual. Their strengths and weaknesses relative to each other are not likely to push their results outside such a range, over the long run.
Players that have a rock-paper-scissors relationship are probably the same rank, or at most one rank apart. I don't think that a player that is two ranks weaker will score a positive result in the long run.
All this, of course, argues for using a traditional approach to handicaps, without using the komi to fine-grain it. On the other hand, there is no real harm in it. The playing strength of amateur players fluctuates so greatly from game to game, that the results of a few points komi will likely disappear in the noise anyway.
A rating system can certainly exist without handicaps. Chess rating systems don't depend on data from handicap games. The fact that Go has a serviceable handicap system makes it in some ways harder to design a rating system, because people expect the rating system to be able suggest a handicap that maximizes the probability of having a fair game between two players of different strengths. Even worse, they expect those handicap games to be rated. I’m of the opinion that the rock-paper-scissors (or multi-dimensional skills) problem is not the biggest challenge. In fact, almost no one complains about that phenomenon; it’s mostly just fun to observe when it occurs. Big challenges include things like standardizing rating systems across regions. Another big challenge is adapting to the wide ranges players of the same strength exhibit in how frequently they play and how fast they improve. (For example, how quickly do you correct an underrated players rating?)
What tuning the komi-handicap system aims to achieve is getting a single game to have a 50-50 chance of either player winning. That's interesting if you are designing a rating system for computer go (especially since computer go tournaments are frequently on small boards.) With amateur human players, the challenges are not so much about setting the proper handicap given a particular rating difference, but the management of the ratings to begin with. If you have no good way of knowing how accurate the players' ratings are, fine-tuning the handicap formula doesn't help much.
I'm guessing what you will find is that in most contexts, the handicap system is the least of people's worries when it comes to ratings. I've seen many players complain about sandbaggers or delays in their own ratings keeping up to date, but hardly anyone complains about the assigned handicap given a rating difference.
I don't mean to be discouraging, but maybe if you want study proper handicap more, you need to get your data from sources that have fewer rating management problems, and that might be pretty hard.
I think Calvin has hit the nail on the head with regard to a problem in the definition of the purpose of the handicap. One might also consider its purpose to be allowing players of different strengths to have interesting and useful games regardless of their strength. To take your example of influence-oriented fighter A and territory-oriented opening-whiz B, you could view the handicap games that play to A's strength as an opportunity to B to get out of her comfort zone and try to develop the fighting area of her game. She may, after all, someday meet a player who favors A's style but lacks A's weakness in the opening, and she'll be grateful to A for showing her how to fight it out without being able to rely on her opening for a lead. And if A is still able to get down to an even handicap, she now has the opportunity to work on her opening in a way she didn't have to in the handicap games. And she'll be grateful to B for that chance when she meets a fellow fighter who hasn't had her opening fine-tuned by regular practice against someone skilled in that area. Now they may not have have found a handicap system that gets them a perfect 50-50 win distribution, but they have had the opportunity to play different kinds of games and work on different axes of their playing strength. And as Calvin suggests, the variation in their performance, not only from chance and varying attention, but also (one always hopes) from learning and improvement, will likely outweigh the fractional imperfections of the handicap. And, if they play enough together to really get a good statistical grasp on their exact fine-tuned handicap, they're probably friendly enough to be able to win or lose their share (exactly fair or not) gracefully.
Clearly sports and games which involve multiple skills are not strictly transitive. Whether this is a matter of style or luck, is arguable. We must not underestimate the mental impact of a small sample history on itself and its impact on the perception of a global distribution.
Let's take three players of roughly equal strength. Their mutual winning percentage will still not be exactly equal. So it will either be biased to anyone of those three (A wins more often against B AND C), or circular. The circular case will occur slightly less, but as it is counterintuitive it will stand out in collective perception. Because of our thirst for causality, it will find its explanation in aspects of style while they are most likely merely a manifestation of statistics.
The most powerful effect of biased results among equal players is the formation of the idea of a "nemesis". By sheer chance, player A has won the first three games against player B. There is no particular reason why this should happen, except for the fact that it CAN happen. In the fourth game, inevitably the mental advantage will be with A, confirming the narrative which was woven around a series of lucky events. When I speak of a "lucky event" I don't mean an "event full of luck". The luck factor lies in the fact that player A was at his best most of the times when playing B, or that all games were tight but happened to fall towards A.
It is very tempting to believe in "non-transitive styles". I've caught myself doing this in my current passion of table tennis. Fact of the matter is that on the amateur level, differences in style are vastly outweighed by shape of the day, mental strength, luck and of course evolving skill. At the professional level of a skill based game, usually only one style prevails.