New development on the banning. It looks like Google has recognized that it was a bit hasty to just pull tethering apps from the Android Market altogether. Here's what they had to say:
"We inadvertently unpublished your application for all mobile providers; if you like, we can restore your app so that all Android Market users outside the T-Mobile US network will have access to your application."
It doesn't really give a definite answer to my original questions, but it hints at a direction. This seems to indicate that they have a mechanism in place to filter Market listings by carrier. I'm still not too keen on Google allowing the Market to be controlled by carriers at all, but at least we might not have to keep track of half a dozen different Terms of Service just to figure out what we can and can't publish. Users may not agree with my mild optimism here, since they'll be the ones with limited access to apps.
Thanks for all the support and discussion of this issue. I like to think that community pressure was what led Google to reconsider this decision. Keep it up and maybe we can free the Market from carrier rules altogether, or at least make sure that the Android OS proper stays open and suits the community's needs.
Thursday, April 2, 2009
Monday, March 30, 2009
Banned from the Market... ok.
So, I've had my first real clash with the Open Handset Allience today. Wifi Tether for Root Users, an app I'm a contributor for, got banned from the Android Market for violating the Developer Distribution Agreement. The reasoning provided is a rather twisted web, as it turns out. According to the agreement:
"Google enters into distribution agreements with device manufacturers and Authorized Carriers to place the Market software client application for the Market on Devices. These distribution agreements may require the involuntary removal of Products in violation of the Device manufacturer’s or Authorized Carrier’s terms of service."
And then, the T-Mobile Terms of Service say the following (as of 11-18-2008):
"Your Data Plan is intended for Web browsing, messaging, and similar activities on your device and not on any other equipment. Unless explicitly permitted by your Data Plan, other uses, including for example, tethering your device to a personal computer or other hardware, are not permitted."
This raises some interesting questions about this "open" platform. Android phones are supposed to be released for other carriers in the future, right? Does this mean that apps in the Market have to adhere to the ToS for only T-Mobile, even when other carriers sign on? Will all apps have to adhere to the ToS for every carrier that supports Android phones? Why is all of this enforcement Google's job, in the first place? If T-Mobile wants to force people to pay for broadband plans in addition to their phone data, it's their job to either make that attractive to users or strongarm them into it by, say, instituting data caps. Playing cop for cell carriers doesn't really seem like the ideal way to establish credibility as a promoter of free software and a strong development community.
Aside from the issue of "authorized carriers," there are some otherwise valid uses of tethering software which users are now being denied. One of the apps banned was for tethering internet over Bluetooth. (We're working on adding it to ours, someday. See below.) With wifi tethering, the internet has to come in from the cell carrier, but Bluetooth tethering allows a user to connect their phone to a wireless router and then share it with a device that has Bluetooth but no wireless card. This use, by definition, can't violate the T-Mobile ToS, since it doesn't require their data plan at all. And that's not even to mention phones which have been set up to use other carriers who allow for tethering.
To add to the irony, one of the folks who helped develop the initial tethering scripts works for Google, I'm told. Another Google employee has forked Wifi Tether, added Bluetooth support to it, says he and his office-mates use it on their commute, and has even given us a patch we can merge in when we get the time. I know they're not any more responsible for this policy than I am, but it just makes me giggle to know that there's an underground presence inside the machine. Hopefully they (and you) can help us push for a really open Android instead of the same greedy corporate power plays we see from other mobile platforms.
"Google enters into distribution agreements with device manufacturers and Authorized Carriers to place the Market software client application for the Market on Devices. These distribution agreements may require the involuntary removal of Products in violation of the Device manufacturer’s or Authorized Carrier’s terms of service."
And then, the T-Mobile Terms of Service say the following (as of 11-18-2008):
"Your Data Plan is intended for Web browsing, messaging, and similar activities on your device and not on any other equipment. Unless explicitly permitted by your Data Plan, other uses, including for example, tethering your device to a personal computer or other hardware, are not permitted."
This raises some interesting questions about this "open" platform. Android phones are supposed to be released for other carriers in the future, right? Does this mean that apps in the Market have to adhere to the ToS for only T-Mobile, even when other carriers sign on? Will all apps have to adhere to the ToS for every carrier that supports Android phones? Why is all of this enforcement Google's job, in the first place? If T-Mobile wants to force people to pay for broadband plans in addition to their phone data, it's their job to either make that attractive to users or strongarm them into it by, say, instituting data caps. Playing cop for cell carriers doesn't really seem like the ideal way to establish credibility as a promoter of free software and a strong development community.
Aside from the issue of "authorized carriers," there are some otherwise valid uses of tethering software which users are now being denied. One of the apps banned was for tethering internet over Bluetooth. (We're working on adding it to ours, someday. See below.) With wifi tethering, the internet has to come in from the cell carrier, but Bluetooth tethering allows a user to connect their phone to a wireless router and then share it with a device that has Bluetooth but no wireless card. This use, by definition, can't violate the T-Mobile ToS, since it doesn't require their data plan at all. And that's not even to mention phones which have been set up to use other carriers who allow for tethering.
To add to the irony, one of the folks who helped develop the initial tethering scripts works for Google, I'm told. Another Google employee has forked Wifi Tether, added Bluetooth support to it, says he and his office-mates use it on their commute, and has even given us a patch we can merge in when we get the time. I know they're not any more responsible for this policy than I am, but it just makes me giggle to know that there's an underground presence inside the machine. Hopefully they (and you) can help us push for a really open Android instead of the same greedy corporate power plays we see from other mobile platforms.
Wednesday, September 10, 2008
In search of...
I'm taking an advanced class in heuristic search algorithms this semester, and I thought I'd show you all a little about some basic state space search algorithms. It's often possible to solve relatively hard problems using a simple breadth first or depth first search (especially if you're not worried about how long it will take.) We'll look at the algorithms in terms of a classic example. For those who are already familiar, we'll be looking at vacuum world, or how to plan the most efficient path for an automated vacuum.
First, some assumptions: we know the locations of the vacuum and all dirt from the start; actions can be a movement (up, down, left, or right) or sucking up dirt; and any action costs 1 step. We will represent a state as the vacuum location, dirt locations, and the sequence of actions that must be taken to get there. Our algorithms will be given a starting state, an expand function which takes a state and returns its children, and a goal test function. Without getting too muddled in implementation details, we'll look briefly at what the expand and goal test functions might look like.
Before doing any work on the algorithms, we should consider what a solution to the problem means. What we want is a legal sequence of actions that, when executed, leaves the world free of dirt. Simply being able to find such a solution makes an algorithm complete. We also want it to be efficient, meaning that ideally it should take the absolute fewest steps possible. We call algorithms that give the cheapest possible solution admissible.
The first algorithm we'll consider is depth first search, which follows a branch to its end before exploring any others. For each iteration of the main loop, we pull a node from the start of the open list and add its children to the start of the list.
Let's consider what results a depth first search offers. Does it always give us a solution? Its propensity for always diving deeper means that in some cases it could keep diving forever. This is certainly true in vacuum world where we can move in circles or simply never choose to suck up dirt. We can avoid these sorts of infinite loops by maintaining a list of states we've seen (or closed list) along with the cheapest cost for getting there. If we see the same state again at an equal or greater cost, we ignore it (don't expand it or add it to the open list.) When DFS finds solutions, though, are they as cheap as possible? Again, prospects look bleak. It's quite possible that we'll take a very silly route, because our simple implementation would, for instance, go up as far as possible before trying anything else.
Next, we'll look at breadth first search, which explores all states at a given depth before looking further down. For each loop iteration, we pull a node from the start of the open list, but insert its children at the end of the list.
While the implementation seems almost identical, the properties of the two algorithms are very different. Breadth first search is guaranteed to give a solution, if one exists, because it doesn't get caught up focusing on one particular (possibly bad) branch. It also finds the cheapest solution for worlds in which all actions have the same cost, because it will never look at a lower node before it has expanded all nodes above it. We don't need a closed list, but we could maintain one to avoid duplicating work (like how left, right produces the same state as right, left.)
I'd be remiss if I didn't point out the memory and processing complexity of these two algorithms. In the worst case, both can take b^d time (b is the branching factor, decided here by the number of actions; d is the maximum depth that will ever be explored, infinity for DFS and the depth of the best solution for BFS.) When it comes to memory, we finally find a redeeming quality for DFS because it takes b*d space (since it only stores one branch at a time in memory) and BFS takes b^d space (because it stores an entire depth level in memory.)
Next, we'll discuss algorithms that use cost so far and/or a heuristic function to determine which nodes to expand.
First, some assumptions: we know the locations of the vacuum and all dirt from the start; actions can be a movement (up, down, left, or right) or sucking up dirt; and any action costs 1 step. We will represent a state as the vacuum location, dirt locations, and the sequence of actions that must be taken to get there. Our algorithms will be given a starting state, an expand function which takes a state and returns its children, and a goal test function. Without getting too muddled in implementation details, we'll look briefly at what the expand and goal test functions might look like.
#expansion function
def expand(s):
moves = [move_up, move_down, move_right, move_left, suck]
children = []
for move in moves:
if applicable(move): children += [move(s)]
return children
#goal test function
def goal_test(s):
return len(s[DIRT]) == 0
Before doing any work on the algorithms, we should consider what a solution to the problem means. What we want is a legal sequence of actions that, when executed, leaves the world free of dirt. Simply being able to find such a solution makes an algorithm complete. We also want it to be efficient, meaning that ideally it should take the absolute fewest steps possible. We call algorithms that give the cheapest possible solution admissible.
The first algorithm we'll consider is depth first search, which follows a branch to its end before exploring any others. For each iteration of the main loop, we pull a node from the start of the open list and add its children to the start of the list.
#a simple depth first search implementation
def dfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list = expand(curr) + open_list
if not len(open_list) == 0:
curr = open_list.pop(0)
else:
print "no solution possible!"
return
print "goal found:", curr
return
Let's consider what results a depth first search offers. Does it always give us a solution? Its propensity for always diving deeper means that in some cases it could keep diving forever. This is certainly true in vacuum world where we can move in circles or simply never choose to suck up dirt. We can avoid these sorts of infinite loops by maintaining a list of states we've seen (or closed list) along with the cheapest cost for getting there. If we see the same state again at an equal or greater cost, we ignore it (don't expand it or add it to the open list.) When DFS finds solutions, though, are they as cheap as possible? Again, prospects look bleak. It's quite possible that we'll take a very silly route, because our simple implementation would, for instance, go up as far as possible before trying anything else.
Next, we'll look at breadth first search, which explores all states at a given depth before looking further down. For each loop iteration, we pull a node from the start of the open list, but insert its children at the end of the list.
#a simple breadth first search implementation
def bfs(start, expand, goal_test):
open_list = []
curr = start
while not goal_test(curr):
open_list += expand(curr)
if not len(open_list) == 0:
curr = open_list.pop(0)
else:
print "no solution possible!"
return
print "goal found:", curr
return
While the implementation seems almost identical, the properties of the two algorithms are very different. Breadth first search is guaranteed to give a solution, if one exists, because it doesn't get caught up focusing on one particular (possibly bad) branch. It also finds the cheapest solution for worlds in which all actions have the same cost, because it will never look at a lower node before it has expanded all nodes above it. We don't need a closed list, but we could maintain one to avoid duplicating work (like how left, right produces the same state as right, left.)
I'd be remiss if I didn't point out the memory and processing complexity of these two algorithms. In the worst case, both can take b^d time (b is the branching factor, decided here by the number of actions; d is the maximum depth that will ever be explored, infinity for DFS and the depth of the best solution for BFS.) When it comes to memory, we finally find a redeeming quality for DFS because it takes b*d space (since it only stores one branch at a time in memory) and BFS takes b^d space (because it stores an entire depth level in memory.)
Next, we'll discuss algorithms that use cost so far and/or a heuristic function to determine which nodes to expand.
Wednesday, July 9, 2008
Cool site and general update
I've come across an amazing site for those wanting to practice their problem solving. Users can post problems, and then others can post solutions in their style and language of choice. I've done a few in my spare time and I'm debating whether or not to post about them here. I'll at least give a spoiler alert if I do.
The big problems I've been putting actual work into lately have been file I/O for tinypy, MTP hooks for Songbird in Linux and OSX, and some bioinformatics stuff that involves building contigs out of very short sequences. If anyone is interested in working on either of the first two with me, I'd welcome the help. The third I'm getting paid for, so I suppose I should do it myself. I'll try to post about any interesting problems that come up involving those.
Don't be shocked if not too much shows up on here between now and the end of August. My qualifying exams are coming up, so I'm pretty buried in studying materials.
The big problems I've been putting actual work into lately have been file I/O for tinypy, MTP hooks for Songbird in Linux and OSX, and some bioinformatics stuff that involves building contigs out of very short sequences. If anyone is interested in working on either of the first two with me, I'd welcome the help. The third I'm getting paid for, so I suppose I should do it myself. I'll try to post about any interesting problems that come up involving those.
Don't be shocked if not too much shows up on here between now and the end of August. My qualifying exams are coming up, so I'm pretty buried in studying materials.
Friday, April 18, 2008
k nearest neighbors speed up
I was working on implementing the k nearest neighbors algorithm in Python and very quickly discovered that my code was slow. Really slow. Unacceptably slow. For a training set of 1000 items with 150 attributes and a k of 4 it took about 30 seconds to classify 1000 testing items on a fast machine. When I tried to run it on real data I actually had a system administrator kill my process because he thought it was in an infinite loop. Obviously, my code needed some work.
After profiling I found that the most time (by far) was being spent on finding the neighbors. Reading in the data initially and deciding on the most likely classification once the neighbors were found were pretty minor, comparatively.
My original implementation was something like the following:
I computed the distance manually as below:
One of the tricks you'll hear again and again when optimising python is to let C code (compiled or built in libraries) do as much work as you can. I set up NumPy and stored each training and testing item as a NumPy array, and out came this:
The new version ran about twice as fast as the original, but still not fast enough to really be usable. I toyed with an alternate version that calculated the distance for each training item up front, sorted based on distance, and pulled out the first k items. It had about the same time for smallish training sets, but got worse and worse as the training data got larger. It was becoming obvious to me that I'd need to do some real problem solving.
I spent a while looking at the NumPy documentation and reconsidering the way I thought about the data. I knew there had to be a way to do the calculation in one (hopefully fast) set of operations, rather than all this silly looping and sorting/comparing. Then it occurred to me that what I really had was not a list, but a matrix of training data where each row was an individual item. All I really wanted to do was subtract the testing item from each row, square every entry in the resulting matrix, and sum the rows. The result would be an array of distances between the item and all rows in the matrix.
I did just that. The two major perks to this were that it could be done mostly (if not entirely) in compiled NumPy code and that NumPy had a quick argmax function that would give me the index of the nearest neighbor's row in the training matrix (which would correspond to the index in the list of classifications.) What came out of that was the solution below:
It now runs at twice the speed of the version using the original algorithm with optimized distance function. For those keeping track, that's four times faster than my first attempt. It's definitely not as fast as having written it all in pretty much any compiled language, but it's snappy. I even added back in that slow square root operation because calling it once does it for each item in the final array and there's plenty of time to spare still.
Oddly enough, what I'm doing isn't really all that different from the original design. What made the difference was being open to viewing the data as something other than what I'd initially assumed it to be. It wasn't a list of training items to be manually compared to testing items one at a time, it was a matrix of points in a multidimensional space. I imagine the distances are still calculated in some sort of loop(s), but it isn't in my own slow Python code. It's in the trustworthy hands of the people behind NumPy, because their tool is the best one for the job.
As for further application, I think this is a lesson in search problems in general. Many searches can be optimized using mathematical properties of the data's domain or nifty data structures. For extra credit, check out kd Trees.
After profiling I found that the most time (by far) was being spent on finding the neighbors. Reading in the data initially and deciding on the most likely classification once the neighbors were found were pretty minor, comparatively.
My original implementation was something like the following:
#seed with first k items
neighbors = [(distance(x, item), i) for i, x in enumerate(training[:k])]
neighbors.sort(key=operator.itemgetter(0))
#look for closer items
for i, x in enumerate(training[k:]):
dis = distance(x, item)
if dis < neighbors[0][0]:
neighbors[0] = (dis, i)
neighbors.sort(key=operator.itemgetter(0))
I computed the distance manually as below:
def distance(x, y):
d = 0
for i in range(0, len(x)):
d += (x[i]-y[i])**2
#d = sqrt(d) #can be cut for speed
return d
One of the tricks you'll hear again and again when optimising python is to let C code (compiled or built in libraries) do as much work as you can. I set up NumPy and stored each training and testing item as a NumPy array, and out came this:
def distance(x, y):
d = x-y
return dot(d, d)
The new version ran about twice as fast as the original, but still not fast enough to really be usable. I toyed with an alternate version that calculated the distance for each training item up front, sorted based on distance, and pulled out the first k items. It had about the same time for smallish training sets, but got worse and worse as the training data got larger. It was becoming obvious to me that I'd need to do some real problem solving.
I spent a while looking at the NumPy documentation and reconsidering the way I thought about the data. I knew there had to be a way to do the calculation in one (hopefully fast) set of operations, rather than all this silly looping and sorting/comparing. Then it occurred to me that what I really had was not a list, but a matrix of training data where each row was an individual item. All I really wanted to do was subtract the testing item from each row, square every entry in the resulting matrix, and sum the rows. The result would be an array of distances between the item and all rows in the matrix.
I did just that. The two major perks to this were that it could be done mostly (if not entirely) in compiled NumPy code and that NumPy had a quick argmax function that would give me the index of the nearest neighbor's row in the training matrix (which would correspond to the index in the list of classifications.) What came out of that was the solution below:
#get array of distances between item and all others
distances = sqrt(((training-item)**2).sum(axis=1))
#get closest item to start with
closest = distances.argmin()
neighbors = (closest, )
#set current closest to large number and find next
for i in range(1, k):
distances[closest] = sys.maxint
closest = distances.argmin()
neighbors += (closest, )
It now runs at twice the speed of the version using the original algorithm with optimized distance function. For those keeping track, that's four times faster than my first attempt. It's definitely not as fast as having written it all in pretty much any compiled language, but it's snappy. I even added back in that slow square root operation because calling it once does it for each item in the final array and there's plenty of time to spare still.
Oddly enough, what I'm doing isn't really all that different from the original design. What made the difference was being open to viewing the data as something other than what I'd initially assumed it to be. It wasn't a list of training items to be manually compared to testing items one at a time, it was a matrix of points in a multidimensional space. I imagine the distances are still calculated in some sort of loop(s), but it isn't in my own slow Python code. It's in the trustworthy hands of the people behind NumPy, because their tool is the best one for the job.
As for further application, I think this is a lesson in search problems in general. Many searches can be optimized using mathematical properties of the data's domain or nifty data structures. For extra credit, check out kd Trees.
Thursday, January 17, 2008
REALLY basic example
I got my hands on a copy of Short-Cut Math recently and I'm loving it. The author presents methods for simplifying almost any tough problem involving addition, subtraction, multiplication, or division. One example he gives is how multiplying by .25, 2.5, 25, or 250 can be done (sometimes more easily) by dividing by 4 and then multiplying by 1, 10, 100, or 1000, respectively. Obviously this will be faster or slower depending on how easy the number is to divide by 4 and how well you do with division, but it's an option available to speed things up in some cases.
He also discusses how multiplying by a number comprised of the same digit (for example, 888) is easier because you only have to take the partial answer from the first digit and adjust by a factor of 10 rather than getting a different partial value for each digit. Likewise, numbers comprised of multiples (like 842 or 168) are easier than others because you get the answer for one digit and multiply it by the appropriate factor for each other digit or set of digits.
With that background information out of the way, on to the problem I'm solving here. The other day I was trying to subtract some numbers from each other (48-16, 56-28, 132-24, for example) and felt like there was a pattern that might lead to a non-standard method for solving them. In case you didn't catch it right away, the minuend and subtrahend in each case share a common factor (8, 7, 12 being the most immediately apparent to me at the time). The aforementioned multiplication trick led me to think that there might be a trick I could use involving this property. If we examine the answer for each (32, 28, 108), it may be obvious that the answer also shares the factor. What's interesting is that the answer is the result of the following equation: (minuend/factor - subtrahend/factor)*factor. I happen to like this method because division and multiplication come much more naturally to me than subtraction.
Of course, in subtraction this might not always be terribly practical to do. It's very nice when one immediately notices that the subtrahend in 56-28 is itself a factor of the minuend, but it's not always that simple. The time it takes to identify the common factor, divide by it, subtract, and multiply the result by the factor may be longer than what it would take to do the subtraction with the original numbers (especially if you have other short-cuts at hand), but can it be generalized? For addition, it seems equally applicable and equally problematic. We still have to divide at least twice and multiply at the end. In multiplication, a similar method can be used with similarly questionable improvement. It would be of the form ((multiplicand/factor)*(multiplier/factor))*factor^2, which certainly does not appear simpler than some of the other short-cuts available. It can, of course, be even more problematic if no factor is shared and we wasted time trying to find one.
Now, can we find a group of cases for which this method is especially well-suited? What if we are dealing with a long list of numbers that is known beforehand to share a common factor? One such example might be if you were trying to calculate the amount of memory available in a cluster of computers with different amounts of RAM. We might know ahead of time that their RAM comes in increments of 256MB. Even if you're not geeky enough to be accustomed to thinking quickly in increments of 256, this problem is very well suited to our new method. You can add up the total in smaller numbers and multiply at the end by 256 or just divide by 4 and get the total in GB.
So, we see that this method might be overly complex for small cases or ones in which the factors are previously unknown or foreign to us, but is fairly well-suited to problems with a long list of numbers which all share a factor that is known in advance and perhaps familiar to work with.
He also discusses how multiplying by a number comprised of the same digit (for example, 888) is easier because you only have to take the partial answer from the first digit and adjust by a factor of 10 rather than getting a different partial value for each digit. Likewise, numbers comprised of multiples (like 842 or 168) are easier than others because you get the answer for one digit and multiply it by the appropriate factor for each other digit or set of digits.
With that background information out of the way, on to the problem I'm solving here. The other day I was trying to subtract some numbers from each other (48-16, 56-28, 132-24, for example) and felt like there was a pattern that might lead to a non-standard method for solving them. In case you didn't catch it right away, the minuend and subtrahend in each case share a common factor (8, 7, 12 being the most immediately apparent to me at the time). The aforementioned multiplication trick led me to think that there might be a trick I could use involving this property. If we examine the answer for each (32, 28, 108), it may be obvious that the answer also shares the factor. What's interesting is that the answer is the result of the following equation: (minuend/factor - subtrahend/factor)*factor. I happen to like this method because division and multiplication come much more naturally to me than subtraction.
Of course, in subtraction this might not always be terribly practical to do. It's very nice when one immediately notices that the subtrahend in 56-28 is itself a factor of the minuend, but it's not always that simple. The time it takes to identify the common factor, divide by it, subtract, and multiply the result by the factor may be longer than what it would take to do the subtraction with the original numbers (especially if you have other short-cuts at hand), but can it be generalized? For addition, it seems equally applicable and equally problematic. We still have to divide at least twice and multiply at the end. In multiplication, a similar method can be used with similarly questionable improvement. It would be of the form ((multiplicand/factor)*(multiplier/factor))*factor^2, which certainly does not appear simpler than some of the other short-cuts available. It can, of course, be even more problematic if no factor is shared and we wasted time trying to find one.
Now, can we find a group of cases for which this method is especially well-suited? What if we are dealing with a long list of numbers that is known beforehand to share a common factor? One such example might be if you were trying to calculate the amount of memory available in a cluster of computers with different amounts of RAM. We might know ahead of time that their RAM comes in increments of 256MB. Even if you're not geeky enough to be accustomed to thinking quickly in increments of 256, this problem is very well suited to our new method. You can add up the total in smaller numbers and multiply at the end by 256 or just divide by 4 and get the total in GB.
So, we see that this method might be overly complex for small cases or ones in which the factors are previously unknown or foreign to us, but is fairly well-suited to problems with a long list of numbers which all share a factor that is known in advance and perhaps familiar to work with.
Wednesday, October 31, 2007
Hello, world!
I'm a PhD student in Computer Science at the University of New Hampshire. I created this blog to encourage myself to work on my problem solving skills, particularly in Geometry, Linear Algebra, Algorithms, Formal Logic, and Data Structures. However, topics may also include things as far removed from the above as composting, time management, home energy efficiency, and permaculture. I'll be using a method much like George Polya's in abstracting the problem, noting similar problems with known solutions, creating and implementing a planned solution, and generalizing my solution. The goal is at least 1 problem per week. Please feel free to suggest problems or alternate solutions.
Subscribe to:
Posts (Atom)