Our problem is that we have a set of classes we would like to assign to items we're given, a set of already-classified examples, and a similarity (or distance) function that can be used on any pair of items. When we are given new items with no labels, we want to be able to classify them based on similarity to our labeled examples, using this similarity function. We solve this same problem intuitively every day with varied results. For instance, we often assign different shapes to pre-defined classes (thin, thick, pointy, smooth, etc.) based on features like number of vertices, angles between edges, length of edges, etc.
What I often find myself doing when the class of an object is ambiguous is searching for a few already-classified objects that are the most similar to it. Since we can think of similarity in this case as distance, we can call these the "neighbors" of the object. If all of these neighbors belong to the same class, it's a pretty fair assumption that the new object should, too. If they don't, the most straightforward option is to go with the class to which the most neighbors belong. Essentially, I'm using the k-nearest neighbors algorithm when I do this. We can outline the procedure for finding the neighbors as:
def find_neighbors(k, training, new_item): #initialize the set of neighbors with # the first k training items neighbors = [(dist(x, new_item), x) for x in training[:k]] #sort based on distance neighbors = sorted(neighbors) max_neighbor_dist = neighbors[k-1] for item in training[k:]: d = dist(item, new_item) if d < max_neighbor_dist: #push the last neighbor out and re-sort neighbors[k-1] = (d, item) neighbors = sorted(neighbors) return neighbors
All we'd need now is to look at the classes of those neighbors returned and make our decision about the most likely class for this new item. But we might want to reconsider our definition of "most likely."
You may recall that I said that taking the majority class was straightforward, but I didn't say that it was the best. In fact, let's consider the case where we had 2 classes, RED and BLUE. Imagine our training set contains 200 items, 101 are labeled RED and 99 are labeled BLUE. We could increase k all the way to 200, our training set size. Seems like a smart idea, since we now don't even need to calculate which are the neighbors and we have more information when making our decision than we would with a lower k. However, this majority class scheme ignores a pretty big bit of information we have about the neighbors: the distance. Why should all neighbors get an equal impact on the classification of this new item? What if it is almost exactly the same as one item and extremely dissimilar from all others? What we really want to do is for that class count to be a weighted sum, based on distance from the new item. It's not totally trivial, since we have to decide on the way we'll weight them and how to normalize those weights to each other, but hopefully the idea makes sense. Now if we increase k to the size of our training set we will not always return the majority class, but rather make an informed decision by using the similarity of the new item to all of our examples.
While k-nearest neighbors is fairly simple and elegant, it's not terribly efficient. Having to calculate distances between each new item and each training item is not cheap. If you ever work with this algorithm, you'll want to consider looking at ways of avoiding all of those calculations. I'll give you a hint to look at data structures that help you partition the space to which the training examples belong.