Wednesday, September 15, 2010

Neighbors are useful

It's been a while since I worked an actual problem on here, so I thought I'd do one I've been itching to talk about for a few months now. It's the problem of classification when given training examples and a similarity function.

Our problem is that we have a set of classes we would like to assign to items we're given, a set of already-classified examples, and a similarity (or distance) function that can be used on any pair of items. When we are given new items with no labels, we want to be able to classify them based on similarity to our labeled examples, using this similarity function. We solve this same problem intuitively every day with varied results. For instance, we often assign different shapes to pre-defined classes (thin, thick, pointy, smooth, etc.) based on features like number of vertices, angles between edges, length of edges, etc.

What I often find myself doing when the class of an object is ambiguous is searching for a few already-classified objects that are the most similar to it. Since we can think of similarity in this case as distance, we can call these the "neighbors" of the object. If all of these neighbors belong to the same class, it's a pretty fair assumption that the new object should, too. If they don't, the most straightforward option is to go with the class to which the most neighbors belong. Essentially, I'm using the k-nearest neighbors algorithm when I do this. We can outline the procedure for finding the neighbors as:

def find_neighbors(k, training, new_item):
  #initialize the set of neighbors with 
  #  the first k training items
  neighbors = [(dist(x, new_item), x) for x in training[:k]]
  #sort based on distance
  neighbors = sorted(neighbors)
  max_neighbor_dist = neighbors[k-1][0]
  for item in training[k:]:
    d = dist(item, new_item)
    if d < max_neighbor_dist:
      #push the last neighbor out and re-sort
      neighbors[k-1] = (d, item)
      neighbors = sorted(neighbors)
  return neighbors

All we'd need now is to look at the classes of those neighbors returned and make our decision about the most likely class for this new item. But we might want to reconsider our definition of "most likely."

You may recall that I said that taking the majority class was straightforward, but I didn't say that it was the best. In fact, let's consider the case where we had 2 classes, RED and BLUE. Imagine our training set contains 200 items, 101 are labeled RED and 99 are labeled BLUE. We could increase k all the way to 200, our training set size. Seems like a smart idea, since we now don't even need to calculate which are the neighbors and we have more information when making our decision than we would with a lower k. However, this majority class scheme ignores a pretty big bit of information we have about the neighbors: the distance. Why should all neighbors get an equal impact on the classification of this new item? What if it is almost exactly the same as one item and extremely dissimilar from all others? What we really want to do is for that class count to be a weighted sum, based on distance from the new item. It's not totally trivial, since we have to decide on the way we'll weight them and how to normalize those weights to each other, but hopefully the idea makes sense. Now if we increase k to the size of our training set we will not always return the majority class, but rather make an informed decision by using the similarity of the new item to all of our examples.

While k-nearest neighbors is fairly simple and elegant, it's not terribly efficient. Having to calculate distances between each new item and each training item is not cheap. If you ever work with this algorithm, you'll want to consider looking at ways of avoiding all of those calculations. I'll give you a hint to look at data structures that help you partition the space to which the training examples belong.

Saturday, September 11, 2010

How Sofia ended up in Computer Science

I was a very big science geek as a child: reading, watching videos, and trying experiments that looked a lot more like voodoo than actual science. I'm shocked that no one got poisoned or blown up. Biology was the science I explored the most, because I found plants and animals fascinating. There were times where I simultaneously had several ant farms, a few dozen flowers or vegetables growing, toads I'd found outside, and baby birds I was nursing to adulthood. Even into my teens I was primarily interested in life sciences, to the extent that it was originally my major when I started undergrad.

The first experience I remember with computers was a summer science, math and computing class that I can't place a date on, but which included some turtle graphics style programming. Sometime later, I had a handed down IBM PCjr in my bedroom. I have only a vague memory of playing games on the PCjr and being very confused when I tried to work with Basic. I think my mother gave me the machine because of some spelling and math quiz games that were available for it. I couldn't have been older than 8 at the time, so I'm pretty unclear on the details now. I just remember the combination of entertainment and mystery that computers presented.

When I was in about 4th grade, my dad started a computer repair business out of his home and I got to see the inner workings of computers up close (both literally and metaphorically.) I got to use a soldering iron and help debug software and hardware issues with almost no points of reference except what we could gather experimentally. Neither of us had much formal training, so it was a very non-hierarchical collaboration. Dad had done some repair for Nintendo and Sega gaming systems before, making him the de facto hardware tech and leaving me to explore the software side. I spent most of my pre-teen and early teen years engrossed in hacking other people's software. Learning how to modify the rules of the games from the $5 bin at Walmart and trying to find just the right combinations of drivers and settings to get newer operating systems to work on too-old hardware for my family and other people like us who could barely afford to buy their own system, even when it was used and outdated.

I loved the puzzles and the challenges in working with software, but I hated feeling like it all belonged to someone else. The web was the first place I found where I could easily make something mine. HTML was straightforward enough that I was quickly able to look at other people's examples and build my own ideas from it. My website was always a hideous piece of experimental art, trying out some new formatting or even the occasional script. In high school I explored the world of C and C++ with what I considered a reasonable amount of success at the time. Out of recognition for the joy that computing brought me, I signed up for a few CS classes along with my beginning in-major Biology classes (which I had been told were serious science, rather than those silly games and web pages.) As I got into classes, I found that my enjoyment of life sciences was more informal and my love for computing was something much deeper than I'd thought. Near the end of my undergraduate work, I tried summer and part-time jobs in professional programming, but didn't find them to be as rewarding as the more scientific and academic side of things. I stayed for my Master's, during which I got my first chance to TA and absolutely loved working with students. I'm now working on my PhD and enjoy a healthy balance between research and teaching.

I feel like I should also tell the story of how I became (or, rather, discovered myself as) a woman in computer science. Despite being biologically male, I found myself always identifying with females and fitting better into female roles. While I faced some opposition from adults and other kids, I was just fine with being a "sissy" in my younger years. In high school, I found that people had much less tolerance for those who didn't fit the roles assigned to them. I felt that the only way to be safe again was to culminate an image of being angry, anti-social, and unpredictable. It didn't entirely stop the violence, but it made people a little more wary about when and how they did it. I learned quickly that I couldn't depend on adults to stop it or even not to participate in it. This led to me entering college still presenting the image of a disconnected, surly, tough outsider. College was in my home town, so all of the same triggers (and most of the same bullies) were still right there. I got involved with GLBT groups and privately explored many different aspects of my personality, but my emotions on the topic of gender had been so muddled that even I didn't know why I felt that so many things in my life didn't fit me. It wasn't until I moved away for grad school that I really got a handle on what felt most natural to me and why. The resistance of others had stuck with me so firmly that I found myself repeating the mantra "just don't be a girl" when confronting any gender-related conflict. Finally realizing the voice saying this in my head wasn't my own, I came out to myself and those around me. I've been amazed at the level of support from my partner, friends, students, and colleagues, especially since my vision of the most likely outcome involved pitchforks and torches. I've spent the last little while learning what it's like to be a woman in academia, computing, restaurants, airports, and just about everywhere else. It's amazing, and I wouldn't trade it for anything.

Monday, November 30, 2009

Review: Always Innovating's "Touchbook"

While taking notes at conferences and lectures, I have learned two things:

  1. There are never enough outlets to keep the average laptop running all day.

  2. Graphical representations are a pain to record with a mouse.

Most devices I've found which handle these use cases are prohibitively expensive or tied to proprietary software which I can't easily work with or transition to. After much searching and waiting, I found one device that met all my requirements: the Touchbook.

Built on open source software and hardware, the Touchbook offers a touchscreen, modular keyboard/touchpad, and 10 hour battery life. What I mean by modular keyboard is that the main components are contained in the screen/tablet, which can be detached from the keyboard/bottom and used as a standalone device. The battery life is due to both the use of an ARM processor and the dual batteries in both the tablet and the keyboard. This means it has less processing power than Atom devices, but I do all my heavy computation remotely, anyway.

When I say the hardware is open, I mean it both in spirit and practice. In addition to being able to detach the screen, opening the device and accessing the internal components is as easy as the flip of a switch. There are 2 mini-USB ports and 5 full-sized ones (3 internal and 2 external.) The included wifi and bluetooth cards occupy 2 of the internal USB slots, although I yanked the bluetooth and plug it externally when I need it. In general, the hardware seems to be very good quality, especially considering the price. The only complaint I have with the hardware is that the screen outweighs the base and leads to a tendency for the device to roll when at too extreme an angle. There is, however, a counter-balance attachment offered for free, which I haven't had a chance to get yet. It will be included with all future shipments. Because main storage is provided via an SD card reader, I'm told it is simple to switch between multiple operating systems or configurations by changing out the SD card.

The official operating system, AIOS, is a derivative of Angstrom tweaked to work with the touch interface. It, unlike the hardware, is still in beta and will definitely benefit from further work. Overall, it's a very nice OS with the ipkg installer and most of the configuration tools one might expect. However, it took me some work to get the programs I wanted and the to fix a few issues I've encountered. While touch and rotation are supported, there are still a few bugs of varying degree. My least favorite involves noise when drawing/clicking, but it doesn't make it unusable for my purposes. Support has been great, but again I don't think it would be as straightforward for novices. In evidence of both points, the founder of the company compiled a kernel for me with support for a USB wired ethernet card there were no pre-compiled drivers for. There has been some work to port other operating systems, such as Ubuntu and Android. In fact, the company has offered a free Touchbook to anyone who can get Android working (along with certain features.) Granted, one would need access to a device to test, but there is an emulator which can be used for most of the work.

Overall, I recommend the Touchbook for people with needs similar to mine. If you want a long-lasting, Linux-friendly, tablet computer, it's the way to go. The hardware is good quality and quite modular. The software, when finished, could be great for users of varying degrees of experience. Right now, it's only for the brave and adept, but the community is good and we'll all work together to make what we need of it. Come join us.

Thursday, April 2, 2009

Partial fix

New development on the banning. It looks like Google has recognized that it was a bit hasty to just pull tethering apps from the Android Market altogether. Here's what they had to say:

"We inadvertently unpublished your application for all mobile providers; if you like, we can restore your app so that all Android Market users outside the T-Mobile US network will have access to your application."

It doesn't really give a definite answer to my original questions, but it hints at a direction. This seems to indicate that they have a mechanism in place to filter Market listings by carrier. I'm still not too keen on Google allowing the Market to be controlled by carriers at all, but at least we might not have to keep track of half a dozen different Terms of Service just to figure out what we can and can't publish. Users may not agree with my mild optimism here, since they'll be the ones with limited access to apps.

Thanks for all the support and discussion of this issue. I like to think that community pressure was what led Google to reconsider this decision. Keep it up and maybe we can free the Market from carrier rules altogether, or at least make sure that the Android OS proper stays open and suits the community's needs.

Monday, March 30, 2009

Banned from the Market... ok.

So, I've had my first real clash with the Open Handset Allience today. Wifi Tether for Root Users, an app I'm a contributor for, got banned from the Android Market for violating the Developer Distribution Agreement. The reasoning provided is a rather twisted web, as it turns out. According to the agreement:

"Google enters into distribution agreements with device manufacturers and Authorized Carriers to place the Market software client application for the Market on Devices. These distribution agreements may require the involuntary removal of Products in violation of the Device manufacturer’s or Authorized Carrier’s terms of service."

And then, the T-Mobile Terms of Service say the following (as of 11-18-2008):

"Your Data Plan is intended for Web browsing, messaging, and similar activities on your device and not on any other equipment. Unless explicitly permitted by your Data Plan, other uses, including for example, tethering your device to a personal computer or other hardware, are not permitted."

This raises some interesting questions about this "open" platform. Android phones are supposed to be released for other carriers in the future, right? Does this mean that apps in the Market have to adhere to the ToS for only T-Mobile, even when other carriers sign on? Will all apps have to adhere to the ToS for every carrier that supports Android phones? Why is all of this enforcement Google's job, in the first place? If T-Mobile wants to force people to pay for broadband plans in addition to their phone data, it's their job to either make that attractive to users or strongarm them into it by, say, instituting data caps. Playing cop for cell carriers doesn't really seem like the ideal way to establish credibility as a promoter of free software and a strong development community.

Aside from the issue of "authorized carriers," there are some otherwise valid uses of tethering software which users are now being denied. One of the apps banned was for tethering internet over Bluetooth. (We're working on adding it to ours, someday. See below.) With wifi tethering, the internet has to come in from the cell carrier, but Bluetooth tethering allows a user to connect their phone to a wireless router and then share it with a device that has Bluetooth but no wireless card. This use, by definition, can't violate the T-Mobile ToS, since it doesn't require their data plan at all. And that's not even to mention phones which have been set up to use other carriers who allow for tethering.

To add to the irony, one of the folks who helped develop the initial tethering scripts works for Google, I'm told. Another Google employee has forked Wifi Tether, added Bluetooth support to it, says he and his office-mates use it on their commute, and has even given us a patch we can merge in when we get the time. I know they're not any more responsible for this policy than I am, but it just makes me giggle to know that there's an underground presence inside the machine. Hopefully they (and you) can help us push for a really open Android instead of the same greedy corporate power plays we see from other mobile platforms.

Wednesday, September 10, 2008

In search of...

I'm taking an advanced class in heuristic search algorithms this semester, and I thought I'd show you all a little about some basic state space search algorithms. It's often possible to solve relatively hard problems using a simple breadth first or depth first search (especially if you're not worried about how long it will take.) We'll look at the algorithms in terms of a classic example. For those who are already familiar, we'll be looking at vacuum world, or how to plan the most efficient path for an automated vacuum.

First, some assumptions: we know the locations of the vacuum and all dirt from the start; actions can be a movement (up, down, left, or right) or sucking up dirt; and any action costs 1 step. We will represent a state as the vacuum location, dirt locations, and the sequence of actions that must be taken to get there. Our algorithms will be given a starting state, an expand function which takes a state and returns its children (those states reachable from it by taking one action), and a goal test function to see if we've gotten all the dirt. Without getting too muddled in implementation details, we'll look briefly at what the expand and goal test functions might look like.
#expansion function
def expand(s):
moves = [move_up, move_down, move_right, move_left, suck]

children = []
for move in moves:
if applicable(move): children += [move(s)]
return children

#goal test function
def goal_test(s):
return len(s[DIRT]) == 0

Before doing any work on the algorithms, we should consider what a solution to the problem means. What we want is a legal sequence of actions that, when executed, leaves the world free of dirt. Simply being able to find such a solution makes an algorithm complete. We also want it to be efficient, meaning that ideally it should take the absolute fewest steps possible. We call algorithms that give the cheapest possible solution admissible.

The first algorithm we'll consider is depth first search, which follows a plan (branch) to its end before exploring any others. For each iteration of the main loop, we pull a node from the start of the open list and add its children to the start of the list.
#a simple depth first search implementation
def dfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list = expand(curr) + open_list
if not len(open_list) == 0:
curr = open_list.pop(0)
print "no solution possible!"
print "goal found:", curr

Let's consider what results a depth first search offers. Does it always give us a solution? Its propensity for always diving deeper means that in some cases it could keep diving forever. This is certainly true in vacuum world where we can move in circles or simply never choose to suck up dirt. We can avoid these sorts of infinite loops by maintaining a list of states we've seen (or closed list) along with the cheapest cost for getting there. If we see the same state again at an equal or greater cost, we ignore it (don't expand it or add it to the open list.) When DFS finds solutions, though, are they as cheap as possible? Again, prospects look bleak. It's quite possible that we'll take a very silly route, because our simple implementation would, for instance, go up as far as possible before trying anything else.

Next, we'll look at breadth first search, which explores all states at a given depth (with the same number of actions taken so far) before looking further down. For each loop iteration, we pull a node from the start of the open list, but insert its children at the end of the list.
#a simple breadth first search implementation
def bfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list += expand(curr)
if not len(open_list) == 0:
curr = open_list.pop(0)
print "no solution possible!"
print "goal found:", curr

While the implementation seems almost identical, the properties of the two algorithms are very different. Breadth first search is guaranteed to give a solution, if one exists, because it doesn't get caught up focusing on one particular (possibly bad) branch. It also finds the cheapest solution for worlds in which all actions have the same cost, because it will never look at a lower node before it has expanded all nodes above it. We don't need a closed list, but we could maintain one to avoid duplicating work (like how left, right produces the same state as right, left.)

I'd be remiss if I didn't point out the memory and processing complexity of these two algorithms. In the worst case, both can take b^d time (b is the branching factor, decided here by the number of actions; d is the maximum depth that will ever be explored, infinity for DFS and the depth of the best solution for BFS.) When it comes to memory, we finally find a redeeming quality for DFS because it takes b*d space (since it only stores one branch at a time in memory) and BFS takes b^d space (because it stores an entire depth level in memory.)

Next, we'll discuss algorithms that use cost so far and/or a heuristic function to determine which nodes to expand.

Wednesday, July 9, 2008

Cool site and general update

I've come across an amazing site for those wanting to practice their problem solving. Users can post problems, and then others can post solutions in their style and language of choice. I've done a few in my spare time and I'm debating whether or not to post about them here. I'll at least give a spoiler alert if I do.

The big problems I've been putting actual work into lately have been file I/O for tinypy, MTP hooks for Songbird in Linux and OSX, and some bioinformatics stuff that involves building contigs out of very short sequences. If anyone is interested in working on either of the first two with me, I'd welcome the help. The third I'm getting paid for, so I suppose I should do it myself. I'll try to post about any interesting problems that come up involving those.

Don't be shocked if not too much shows up on here between now and the end of August. My qualifying exams are coming up, so I'm pretty buried in studying materials.