Wednesday, September 10, 2008

In search of...

I'm taking an advanced class in heuristic search algorithms this semester, and I thought I'd show you all a little about some basic state space search algorithms. It's often possible to solve relatively hard problems using a simple breadth first or depth first search (especially if you're not worried about how long it will take.) We'll look at the algorithms in terms of a classic example. For those who are already familiar, we'll be looking at vacuum world, or how to plan the most efficient path for an automated vacuum.

First, some assumptions: we know the locations of the vacuum and all dirt from the start; actions can be a movement (up, down, left, or right) or sucking up dirt; and any action costs 1 step. We will represent a state as the vacuum location, dirt locations, and the sequence of actions that must be taken to get there. Our algorithms will be given a starting state, an expand function which takes a state and returns its children (those states reachable from it by taking one action), and a goal test function to see if we've gotten all the dirt. Without getting too muddled in implementation details, we'll look briefly at what the expand and goal test functions might look like.
#expansion function
def expand(s):
moves = [move_up, move_down, move_right, move_left, suck]

children = []
for move in moves:
if applicable(move): children += [move(s)]
return children

#goal test function
def goal_test(s):
return len(s[DIRT]) == 0

Before doing any work on the algorithms, we should consider what a solution to the problem means. What we want is a legal sequence of actions that, when executed, leaves the world free of dirt. Simply being able to find such a solution makes an algorithm complete. We also want it to be efficient, meaning that ideally it should take the absolute fewest steps possible. We call algorithms that give the cheapest possible solution admissible.

The first algorithm we'll consider is depth first search, which follows a plan (branch) to its end before exploring any others. For each iteration of the main loop, we pull a node from the start of the open list and add its children to the start of the list.
#a simple depth first search implementation
def dfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list = expand(curr) + open_list
if not len(open_list) == 0:
curr = open_list.pop(0)
else:
print "no solution possible!"
return
print "goal found:", curr
return

Let's consider what results a depth first search offers. Does it always give us a solution? Its propensity for always diving deeper means that in some cases it could keep diving forever. This is certainly true in vacuum world where we can move in circles or simply never choose to suck up dirt. We can avoid these sorts of infinite loops by maintaining a list of states we've seen (or closed list) along with the cheapest cost for getting there. If we see the same state again at an equal or greater cost, we ignore it (don't expand it or add it to the open list.) When DFS finds solutions, though, are they as cheap as possible? Again, prospects look bleak. It's quite possible that we'll take a very silly route, because our simple implementation would, for instance, go up as far as possible before trying anything else.

Next, we'll look at breadth first search, which explores all states at a given depth (with the same number of actions taken so far) before looking further down. For each loop iteration, we pull a node from the start of the open list, but insert its children at the end of the list.
#a simple breadth first search implementation
def bfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list += expand(curr)
if not len(open_list) == 0:
curr = open_list.pop(0)
else:
print "no solution possible!"
return
print "goal found:", curr
return

While the implementation seems almost identical, the properties of the two algorithms are very different. Breadth first search is guaranteed to give a solution, if one exists, because it doesn't get caught up focusing on one particular (possibly bad) branch. It also finds the cheapest solution for worlds in which all actions have the same cost, because it will never look at a lower node before it has expanded all nodes above it. We don't need a closed list, but we could maintain one to avoid duplicating work (like how left, right produces the same state as right, left.)

I'd be remiss if I didn't point out the memory and processing complexity of these two algorithms. In the worst case, both can take b^d time (b is the branching factor, decided here by the number of actions; d is the maximum depth that will ever be explored, infinity for DFS and the depth of the best solution for BFS.) When it comes to memory, we finally find a redeeming quality for DFS because it takes b*d space (since it only stores one branch at a time in memory) and BFS takes b^d space (because it stores an entire depth level in memory.)

Next, we'll discuss algorithms that use cost so far and/or a heuristic function to determine which nodes to expand.

Wednesday, July 9, 2008

Cool site and general update

I've come across an amazing site for those wanting to practice their problem solving. Users can post problems, and then others can post solutions in their style and language of choice. I've done a few in my spare time and I'm debating whether or not to post about them here. I'll at least give a spoiler alert if I do.

The big problems I've been putting actual work into lately have been file I/O for tinypy, MTP hooks for Songbird in Linux and OSX, and some bioinformatics stuff that involves building contigs out of very short sequences. If anyone is interested in working on either of the first two with me, I'd welcome the help. The third I'm getting paid for, so I suppose I should do it myself. I'll try to post about any interesting problems that come up involving those.

Don't be shocked if not too much shows up on here between now and the end of August. My qualifying exams are coming up, so I'm pretty buried in studying materials.

Friday, April 18, 2008

k nearest neighbors speed up

I was working on implementing the k nearest neighbors algorithm in Python and very quickly discovered that my code was slow. Really slow. Unacceptably slow. For a training set of 1000 items with 150 attributes and a k of 4 it took about 30 seconds to classify 1000 testing items on a fast machine. When I tried to run it on real data I actually had a system administrator kill my process because he thought it was in an infinite loop. Obviously, my code needed some work.

After profiling I found that the most time (by far) was being spent on finding the neighbors. Reading in the data initially and deciding on the most likely classification once the neighbors were found were pretty minor, comparatively.

My original implementation was something like the following:

#seed with first k items
neighbors = [(distance(x, item), i) for i, x in enumerate(training[:k])]
neighbors.sort(key=operator.itemgetter(0))

#look for closer items
for i, x in enumerate(training[k:]):
dis = distance(x, item)
if dis < neighbors[0][0]:
neighbors[0] = (dis, i)
neighbors.sort(key=operator.itemgetter(0))

I computed the distance manually as below:

def distance(x, y):
d = 0
for i in range(0, len(x)):
d += (x[i]-y[i])**2
#d = sqrt(d) #can be cut for speed
return d

One of the tricks you'll hear again and again when optimising python is to let C code (compiled or built in libraries) do as much work as you can. I set up NumPy and stored each training and testing item as a NumPy array, and out came this:

def distance(x, y):
d = x-y
return dot(d, d)

The new version ran about twice as fast as the original, but still not fast enough to really be usable. I toyed with an alternate version that calculated the distance for each training item up front, sorted based on distance, and pulled out the first k items. It had about the same time for smallish training sets, but got worse and worse as the training data got larger. It was becoming obvious to me that I'd need to do some real problem solving.

I spent a while looking at the NumPy documentation and reconsidering the way I thought about the data. I knew there had to be a way to do the calculation in one (hopefully fast) set of operations, rather than all this silly looping and sorting/comparing. Then it occurred to me that what I really had was not a list, but a matrix of training data where each row was an individual item. All I really wanted to do was subtract the testing item from each row, square every entry in the resulting matrix, and sum the rows. The result would be an array of distances between the item and all rows in the matrix.

I did just that. The two major perks to this were that it could be done mostly (if not entirely) in compiled NumPy code and that NumPy had a quick argmax function that would give me the index of the nearest neighbor's row in the training matrix (which would correspond to the index in the list of classifications.) What came out of that was the solution below:

#get array of distances between item and all others
distances = sqrt(((training-item)**2).sum(axis=1))

#get closest item to start with
closest = distances.argmin()
neighbors = (closest, )

#set current closest to large number and find next
for i in range(1, k):
distances[closest] = sys.maxint
closest = distances.argmin()
neighbors += (closest, )

It now runs at twice the speed of the version using the original algorithm with optimized distance function. For those keeping track, that's four times faster than my first attempt. It's definitely not as fast as having written it all in pretty much any compiled language, but it's snappy. I even added back in that slow square root operation because calling it once does it for each item in the final array and there's plenty of time to spare still.

Oddly enough, what I'm doing isn't really all that different from the original design. What made the difference was being open to viewing the data as something other than what I'd initially assumed it to be. It wasn't a list of training items to be manually compared to testing items one at a time, it was a matrix of points in a multidimensional space. I imagine the distances are still calculated in some sort of loop(s), but it isn't in my own slow Python code. It's in the trustworthy hands of the people behind NumPy, because their tool is the best one for the job.

As for further application, I think this is a lesson in search problems in general. Many searches can be optimized using mathematical properties of the data's domain or nifty data structures. For extra credit, check out kd Trees.

Thursday, January 17, 2008

REALLY basic example

I got my hands on a copy of Short-Cut Math recently and I'm loving it. The author presents methods for simplifying almost any tough problem involving addition, subtraction, multiplication, or division. One example he gives is how multiplying by .25, 2.5, 25, or 250 can be done (sometimes more easily) by dividing by 4 and then multiplying by 1, 10, 100, or 1000, respectively. Obviously this will be faster or slower depending on how easy the number is to divide by 4 and how well you do with division, but it's an option available to speed things up in some cases.

He also discusses how multiplying by a number comprised of the same digit (for example, 888) is easier because you only have to take the partial answer from the first digit and adjust by a factor of 10 rather than getting a different partial value for each digit. Likewise, numbers comprised of multiples (like 842 or 168) are easier than others because you get the answer for one digit and multiply it by the appropriate factor for each other digit or set of digits.

With that background information out of the way, on to the problem I'm solving here. The other day I was trying to subtract some numbers from each other (48-16, 56-28, 132-24, for example) and felt like there was a pattern that might lead to a non-standard method for solving them. In case you didn't catch it right away, the minuend and subtrahend in each case share a common factor (8, 7, 12 being the most immediately apparent to me at the time). The aforementioned multiplication trick led me to think that there might be a trick I could use involving this property. If we examine the answer for each (32, 28, 108), it may be obvious that the answer also shares the factor. What's interesting is that the answer is the result of the following equation: (minuend/factor - subtrahend/factor)*factor. I happen to like this method because division and multiplication come much more naturally to me than subtraction.

Of course, in subtraction this might not always be terribly practical to do. It's very nice when one immediately notices that the subtrahend in 56-28 is itself a factor of the minuend, but it's not always that simple. The time it takes to identify the common factor, divide by it, subtract, and multiply the result by the factor may be longer than what it would take to do the subtraction with the original numbers (especially if you have other short-cuts at hand), but can it be generalized? For addition, it seems equally applicable and equally problematic. We still have to divide at least twice and multiply at the end. In multiplication, a similar method can be used with similarly questionable improvement. It would be of the form ((multiplicand/factor)*(multiplier/factor))*factor^2, which certainly does not appear simpler than some of the other short-cuts available. It can, of course, be even more problematic if no factor is shared and we wasted time trying to find one.

Now, can we find a group of cases for which this method is especially well-suited? What if we are dealing with a long list of numbers that is known beforehand to share a common factor? One such example might be if you were trying to calculate the amount of memory available in a cluster of computers with different amounts of RAM. We might know ahead of time that their RAM comes in increments of 256MB. Even if you're not geeky enough to be accustomed to thinking quickly in increments of 256, this problem is very well suited to our new method. You can add up the total in smaller numbers and multiply at the end by 256 or just divide by 4 and get the total in GB.

So, we see that this method might be overly complex for small cases or ones in which the factors are previously unknown or foreign to us, but is fairly well-suited to problems with a long list of numbers which all share a factor that is known in advance and perhaps familiar to work with.