Wednesday, September 15, 2010

Neighbors are useful

It's been a while since I worked an actual problem on here, so I thought I'd do one I've been itching to talk about for a few months now. It's the problem of classification when given training examples and a similarity function.

Our problem is that we have a set of classes we would like to assign to items we're given, a set of already-classified examples, and a similarity (or distance) function that can be used on any pair of items. When we are given new items with no labels, we want to be able to classify them based on similarity to our labeled examples, using this similarity function. We solve this same problem intuitively every day with varied results. For instance, we often assign different shapes to pre-defined classes (thin, thick, pointy, smooth, etc.) based on features like number of vertices, angles between edges, length of edges, etc.

What I often find myself doing when the class of an object is ambiguous is searching for a few already-classified objects that are the most similar to it. Since we can think of similarity in this case as distance, we can call these the "neighbors" of the object. If all of these neighbors belong to the same class, it's a pretty fair assumption that the new object should, too. If they don't, the most straightforward option is to go with the class to which the most neighbors belong. Essentially, I'm using the k-nearest neighbors algorithm when I do this. We can outline the procedure for finding the neighbors as:

def find_neighbors(k, training, new_item):
  #initialize the set of neighbors with 
  #  the first k training items
  neighbors = [(dist(x, new_item), x) for x in training[:k]]
  #sort based on distance
  neighbors = sorted(neighbors)
  max_neighbor_dist = neighbors[k-1][0]
  for item in training[k:]:
    d = dist(item, new_item)
    if d < max_neighbor_dist:
      #push the last neighbor out and re-sort
      neighbors[k-1] = (d, item)
      neighbors = sorted(neighbors)
  return neighbors

All we'd need now is to look at the classes of those neighbors returned and make our decision about the most likely class for this new item. But we might want to reconsider our definition of "most likely."

You may recall that I said that taking the majority class was straightforward, but I didn't say that it was the best. In fact, let's consider the case where we had 2 classes, RED and BLUE. Imagine our training set contains 200 items, 101 are labeled RED and 99 are labeled BLUE. We could increase k all the way to 200, our training set size. Seems like a smart idea, since we now don't even need to calculate which are the neighbors and we have more information when making our decision than we would with a lower k. However, this majority class scheme ignores a pretty big bit of information we have about the neighbors: the distance. Why should all neighbors get an equal impact on the classification of this new item? What if it is almost exactly the same as one item and extremely dissimilar from all others? What we really want to do is for that class count to be a weighted sum, based on distance from the new item. It's not totally trivial, since we have to decide on the way we'll weight them and how to normalize those weights to each other, but hopefully the idea makes sense. Now if we increase k to the size of our training set we will not always return the majority class, but rather make an informed decision by using the similarity of the new item to all of our examples.

While k-nearest neighbors is fairly simple and elegant, it's not terribly efficient. Having to calculate distances between each new item and each training item is not cheap. If you ever work with this algorithm, you'll want to consider looking at ways of avoiding all of those calculations. I'll give you a hint to look at data structures that help you partition the space to which the training examples belong.

Saturday, September 11, 2010

How Sofia ended up in Computer Science

I was a very big science geek as a child: reading, watching videos, and trying experiments that looked a lot more like voodoo than actual science. I'm shocked that no one got poisoned or blown up. Biology was the science I explored the most, because I found plants and animals fascinating. There were times where I simultaneously had several ant farms, a few dozen flowers or vegetables growing, toads I'd found outside, and baby birds I was nursing to adulthood. Even into my teens I was primarily interested in life sciences, to the extent that it was originally my major when I started undergrad.

The first experience I remember with computers was a summer science, math and computing class that I can't place a date on, but which included some turtle graphics style programming. Sometime later, I had a handed down IBM PCjr in my bedroom. I have only a vague memory of playing games on the PCjr and being very confused when I tried to work with Basic. I think my mother gave me the machine because of some spelling and math quiz games that were available for it. I couldn't have been older than 8 at the time, so I'm pretty unclear on the details now. I just remember the combination of entertainment and mystery that computers presented.

When I was in about 4th grade, my dad started a computer repair business out of his home and I got to see the inner workings of computers up close (both literally and metaphorically.) I got to use a soldering iron and help debug software and hardware issues with almost no points of reference except what we could gather experimentally. Neither of us had much formal training, so it was a very non-hierarchical collaboration. Dad had done some repair for Nintendo and Sega gaming systems before, making him the de facto hardware tech and leaving me to explore the software side. I spent most of my pre-teen and early teen years engrossed in hacking other people's software. Learning how to modify the rules of the games from the $5 bin at Walmart and trying to find just the right combinations of drivers and settings to get newer operating systems to work on too-old hardware for my family and other people like us who could barely afford to buy their own system, even when it was used and outdated.

I loved the puzzles and the challenges in working with software, but I hated feeling like it all belonged to someone else. The web was the first place I found where I could easily make something mine. HTML was straightforward enough that I was quickly able to look at other people's examples and build my own ideas from it. My website was always a hideous piece of experimental art, trying out some new formatting or even the occasional script. In high school I explored the world of C and C++ with what I considered a reasonable amount of success at the time. Out of recognition for the joy that computing brought me, I signed up for a few CS classes along with my beginning in-major Biology classes (which I had been told were serious science, rather than those silly games and web pages.) As I got into classes, I found that my enjoyment of life sciences was more informal and my love for computing was something much deeper than I'd thought. Near the end of my undergraduate work, I tried summer and part-time jobs in professional programming, but didn't find them to be as rewarding as the more scientific and academic side of things. I stayed for my Master's, during which I got my first chance to TA and absolutely loved working with students. I'm now working on my PhD and enjoy a healthy balance between research and teaching.

I feel like I should also tell the story of how I became (or, rather, discovered myself as) a woman in computer science. Despite being biologically male, I found myself always identifying with females and fitting better into female roles. While I faced some opposition from adults and other kids, I was just fine with being a "sissy" in my younger years. In high school, I found that people had much less tolerance for those who didn't fit the roles assigned to them. I felt that the only way to be safe again was to culminate an image of being angry, anti-social, and unpredictable. It didn't entirely stop the violence, but it made people a little more wary about when and how they did it. I learned quickly that I couldn't depend on adults to stop it or even not to participate in it. This led to me entering college still presenting the image of a disconnected, surly, tough outsider. College was in my home town, so all of the same triggers (and most of the same bullies) were still right there. I got involved with GLBT groups and privately explored many different aspects of my personality, but my emotions on the topic of gender had been so muddled that even I didn't know why I felt that so many things in my life didn't fit me. It wasn't until I moved away for grad school that I really got a handle on what felt most natural to me and why. The resistance of others had stuck with me so firmly that I found myself repeating the mantra "just don't be a girl" when confronting any gender-related conflict. Finally realizing the voice saying this in my head wasn't my own, I came out to myself and those around me. I've been amazed at the level of support from my partner, friends, students, and colleagues, especially since my vision of the most likely outcome involved pitchforks and torches. I've spent the last little while learning what it's like to be a woman in academia, computing, restaurants, airports, and just about everywhere else. It's amazing, and I wouldn't trade it for anything.