create account | help/FAQ | contact | links | search | IRC | site news
 Everything Diaries Technology Science Culture Politics Media News Internet Op-Ed Fiction Meta MLP

 Perceptrons: Intro to Machine Learning By llimllib in TechnologyWed Nov 12, 2003 at 01:41:24 PM EST Tags: Science (all tags) Imagine that you have apples and bananas. Your apples and bananas are in one pile, but you want them to be in separate piles. You don't want to pay some worker to separate the fruits, so you're building a machine to do it. This machine only has two pieces of information about them: their size and how yellow they are. What it needs is a function to discriminate between apples and bananas, so that it can sort them. One way to find a function to discriminate between two classes of things is to use a perceptron. A perceptron is a simple iterative algorithm for finding a discriminant function; in other words, it can find a function to separate our apples and bananas. Although they can only separate two classes of data, and have some other limitations, perceptrons are still an interesting introduction to learning techniques.

 This space intentionally left blank ...because it's waiting for your ad. So why are you still reading this? Come on, get going. Read the story, and then get an ad. Alright stop it. I'm not going to say anything else. Now you're just being silly. STOP LOOKING AT ME! I'm done! comments (24)
Introduction

Let's imagine that you've charted your data on apples and bananas. Fortunately, the two groups form neat clusters, so all we need to do is draw a line between them. The condition of being able to draw a line between two groups is called "linear separability".

Since we want to find a line between the two groups, we'll have to introduce some math. Mainly I'll use some simple algebra and geometry. Don't worry if you don't understand it all right away, just keep reading. It should all start to make sense eventually.

To find a line between the apples and bananas, we'll assign weights to an x and a y coordinate, add a bias b, and set the equation equal to 0. Thus, we have an equation for the line w1*x + w2*y + b = 0.

We call the vector w a weight vector (bold letters denote vectors), which in this case looks like [w1, w2]. The number of weights will be equal to the number of variables in any input vector, which is called the dimensionality of the input vector. Technically, a perceptron tries to find a hyperplane to separate points in an n-dimensional space. In this case, that hyperplane is a line.

Code (technical details)

If you are math and/or code averse, but still interested, you can skip this section and the next, right to "Application".

I've coded an example perceptron in the language Python, using gnuplot, gnuplot.py to draw pretty graphs, and numarray for the math. The code definitely runs on at least Windows and Linux, and it should run on OS X.

Algorithm

What we have is a vector x of data points and a vector y of classifications. y is of the same dimension as x, and an element yi in y is either 1 (banana) or -1 (apple). We'll pass these into our perceptron along with a number n, where 0 < n < 1.

The parameter n is called the learning rate, whose purpose is to affect how quickly the perceptron converges. If n is too high, the discriminant line will swing wildly. Too low, and it won't change quickly enough. n only changes the rate at which the perceptron learns, so it will still converge no matter what n you choose.

`def perceptron(x, y, n):`

Now we set our weights, our bias, a mistake counter, an iteration counter, and a boolean variable all to 0. "na." stands for numarray, my linear algebra library. na.zeros(N, type) creates an NxN zero matrix of type.

``` w = na.zeros(x[0].shape[0], na.Float32) #weights b = 0 #bias k = 0 #error count iteration = 0 #number of iterations mistake_free = 0```

Our main loop is simple. When the perceptron has made a mistake classifying one of the pieces of data, mistake_free is set to false, and we want to continue trying to find a discriminant line. In the case where the input data is not separable, we want to stop iterating our loop at some point. In this case, I've chosen 100 as the iteration limit.

``` while not mistake_free and iteration < 100: iteration += 1 mistake_free = 1 ```

Each iteration, we want to loop through x (our known data points) and calculate the perceptron's current classification of each one. If the perceptron correctly identifies each banana as a banana and each apple as an apple, we're done. Otherwise, we're going to have to change w and b.

``` for i in range(len(x)): actual_out = na.dot(w, x[i]) + b #<w*x> + b if y[i] * actual_out >= 0: #if correctly classified. ```

If we made a mistake, we update the weights such that w = w + n * y[i] * x[i]. Remember that n is our learning rate and y[i] is the correct classification of the current point x[i]. Also, set the bias so b = b + y[i] - actual_out, where actual_out is the incorrect output of the perceptron for x[i].

The reason for the weight updates is fairly simple. If the perceptron misclassified a banana as an apple, we want to move the discriminant down a bit, so we subtract n*x[i] from w. Conversely, we add n*x[i] to w if it was too low. The bias update works similarly.

``` w += n * y[i] * x[i] b += y[i] - actual_out k += 1 mistake_free = 0```

Finally, we just return our weights, our bias, and our mistake counter. While we don't strictly need to return the mistake counter, it's good to know how long it took the perceptron to classify your data correctly. It can help you choose a better learning rate or see that your data is poorly separated.

` return (w, b, k)`

Application

So, now that we've got our data all gathered and our perceptron set up, it's time to apply it. To do so, we put the coordinates of the data points in x, and their corresponding classes, represented by 1 or -1, in y. The result will look something like this. (It will look slightly different if you run the program yourself, since the points are generated randomly.)

We now have a set of weights [w1, w2] and a bias b which make it easy to determine if an unknown piece of fruit is probably a banana or probably an apple. Given its size (s) and its yellowness (t), we simply plug these two numbers back into our original equation, w1*s + w2*t + b, and if the result is positive, we have a banana. If the result is negative, we have an apple. If the result is 0, we have a point on the discriminant line, which means we're unsure about which it is.

History

Perceptrons were introduced in 1956 by Frank Rosenblatt, with the intention of reproducing the way that brains work. For a time, especially in the early 1960s, they were researched quite heavily. In 1969, Minsky and Papert wrote Perceptrons, a book which highlighted the limitations of the technique. They proved that a perceptron could only solve linearly separable problems, famously showing that it could not solve for the binary XOR.

Before Perceptrons, researchers had been creating networks of perceptrons, but almost nobody had studied their limitations. The book stifled nearly all research into them during the 1970s and early 1980s. However, in the 1980s, researchers discovered that multilayer perceptrons did not have the limitations that Minsky had conjectured they did, and neural network research began again in earnest.

Conclusions

For simple two-class classification tasks, perceptrons can function as a quick and dirty method of finding a discriminant. They will run quickly, and are guaranteed to find a solution if one exists. Furthermore, the algorithm is very simple to implement and widely studied.

The real power of the perceptron, however, lies in networking them. A series of interconnected perceptrons, in modified form, is the basis of a neural network. A neural network of two layers is capable of learning any continuous function, and it will perform better on poorly separated and noisy data. Of course, with this power comes difficulty, as neural networks are fairly complicated and difficult to use.

Bibliography

Bishop, Christopher M. Neural Networks for Pattern Recognition, Oxford University Press, 1995.

Cristianini, Nello. Support Vector Machines, Cambridge University Press, 2000.

Minsky, M. L. and Papert, S. A. Perceptrons, MIT Press, 1969.

Perceptrons, http://en.wikipedia.org/wiki/Perceptrons.

Rosenblatt, Frank. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386-408, 1958.

 Poll
which?
 apples 40% bananas 60%

 Votes: 45 Results | Other Polls

 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First
 Perceptrons: Intro to Machine Learning | 48 comments (19 topical, 29 editorial, 0 hidden)
 Good article but why don't you talk about the (1.48 / 27) (#3) by Tex Bigballs on Tue Nov 11, 2003 at 06:05:36 PM EST

 autobots too? Optimus Prime always kept their punk asses in check
 bananas and math don't mix (1.12 / 8) (#13) by dimaq on Wed Nov 12, 2003 at 05:12:27 AM EST

 besides the article is way too long for the amount of knowledge it tries to convey.
 k-means clustering (2.25 / 4) (#27) by Garc on Wed Nov 12, 2003 at 02:14:55 PM EST

 I may be talking out of my rear end here... so I apologize up front :) I wonder if k-means clustering would be applicable to this sort of decision process. You could cluster the items then make sure that they were clustered together correctly (linear separability... sorta). To decide on a new item's class you check nearness to existing clusters. The nice thing about this is that is could scale to more than 2 classes, such as bananas, apples, and grapes and would still work in N dimensions. I can imagine some scenarios where this wouldn't work, although I imagine it'd be pretty decent for real world applications. Any thoughts or criticisms? regards,garc --Tomorrow is going to be wonderful because tonight I do not understand anything. -- Niels Bohr
 sure (none / 2) (#28) by llimllib on Wed Nov 12, 2003 at 03:35:07 PM EST

 k-means clustering would definitely work for this sort of thing, but in a different way. Note, this comes from having just read the section in Bishop (referenced in the bibliography) and MathWorld, not from having actually worked with the technique. Maybe I'll try it.Anyway, k-means clustering definitely would work for more than one subset of the data, as long as you knew how many subsets you wanted to create. It seems to just take the mean of each subset and classify a new point as belonging to the subset with a mean vector closest to it. As Mathworld notes, this doesn't guarantee an optimal separation of subsets, or even a good one, but it is simple. In this respect, it is quite similar to a perceptron. Peace.[ Parent ]
 k-means won't work on the above set (2.40 / 5) (#29) by JoeBayesian on Wed Nov 12, 2003 at 03:48:15 PM EST

 k-means clustering would work very well IF you have nicely separated clusters such that points from different clusters are almost never closer than points from the same cluster. Note that a serious disadvantage of the k-means is that you need to know beforehand what the number of clusters is. In real world applications this is rarely the case. In general, k-means is simply an exploratory algorithm before anything more serious is applied to a dataset. [ Parent ]
 Supervised vs Unsupervised (3.00 / 4) (#37) by dmazzoni on Thu Nov 13, 2003 at 04:14:08 AM EST

 K-means clustering would be appropriate if you had the chart of size vs yellowness without the labels, i.e. nobody told you which points where the bananas and which points were the apples.  This would be an _unsupervised_ learning problem, where you're trying to separate the data into multiple classes without any prior information about the classes. The problem given, and the problem that perceptrons solve, is a _supervised_ learning problem.  It assumes that you have "labels" for the points, telling you which ones are bananas and which ones are apples.  Having this information means that you can tend to get much better results.  Using K-means would be throwing away this useful information for no reason. That doesn't mean K-means couldn't be used as one piece of a larger hybrid machine learning strategy, of course. [ Parent ]
 Perceptrons...Bah! Humbug! (2.20 / 5) (#32) by OzJuggler on Wed Nov 12, 2003 at 11:42:18 PM EST

 This article may have been interesting to those who know nothing about subsymbolic AI. But those of us (such as myself) who've done just a little bit of study on the topic will surely bristle with indignation at the mere mention of the word "Perceptron". This longish technical article even finishes by accurately concluding that perceptrons are practically useless. I agree that the symbolic AI learning topics are too difficult for an introduction. Predicate calculus and ATMs are utterly unintuitive to me. But for subsymbolic learning, Neural Nets are still the one to beat (with GAs being a close second). Perceptrons should remain a mere footnote in CogSci history. You say a unit in an NN is a Perceptron in "modified" form. Let's be honest about this - it stops being a Perceptron the moment you add the S-curve summation and a learning algorithm better than Hebbian. A single unit in an NN is not a Perceptron. But there's nothing to learn from Perceptrons that you can't instead learn from a proper modern neural network. I'm just basically astonished that anyone wasted so much time writing about something as useless as perceptrons. I found an introduction based on NNs and real human neurones to be more stimulating - and useful. I mean, when you've had the best why try the rest? :-) "And I will not rest until every year families gather to spend December 25th together at Osama's homo abortion pot and commie jizzporium." - Jon Stewart's gift to Bill O'Reilly, 7 Dec 2005.
 totally (none / 0) (#34) by benxor on Thu Nov 13, 2003 at 12:56:02 AM EST

 I mean, what was he thinking? =) -- all generalisations are false - including this one[ Parent ]
 well, obviously I disagree (none / 1) (#35) by llimllib on Thu Nov 13, 2003 at 02:32:17 AM EST

 I think by their very simplicity and simple assumptions about the world that they make an ideal entrance into machine learning (I consider artificial intelligence to be something different than you do, I think). They are simple, but they also force you to learn a lot of the basic elements of statistical learning machines - vectors, n-dimensionality, hyperplanes, and other mathematical concepts that you've never thought about in this context.I know that, for me, gaining a deeper understanding of perceptrons has helped my understanding of more complicated machine learning topics such as NNs. Peace.[ Parent ]
 What? (none / 0) (#39) by tkatchev on Thu Nov 13, 2003 at 07:08:46 AM EST

 You are severly outdated. Where I come from, "perceptron" means "multi-layered fully-connected neural network".    -- Signed, Lev Andropoff, cosmonaut. [ Parent ]
 I liked it (none / 0) (#48) by bluehound on Fri Jan 23, 2004 at 04:42:43 PM EST

 This article may have been interesting to those who know nothing about subsymbolic AI That'd be me... I did enjoy it :) -shane[ Parent ]
 I liked the article (none / 1) (#33) by gmol on Wed Nov 12, 2003 at 11:55:59 PM EST

 Good use of code, it was pretty short, and didn't use any pictures and conveyed an idea clearly.
 thanks (none / 0) (#36) by llimllib on Thu Nov 13, 2003 at 02:32:50 AM EST

 textified Peace.[ Parent ]
 Other supervised learning algorithms (3.00 / 9) (#38) by dmazzoni on Thu Nov 13, 2003 at 04:32:58 AM EST

 Here are some other algorithms that could learn to classify bananas and apples: K-nearest-neighbor: This is a very simple algorithm, but it works well enough that you shouldn't ignore it.  Far simpler than a perceptron.  Say that somebody tells you the size and yellowness of a new piece of fruit and you want to guess whether it's a banana or an apple.  Plot the point on the chart, as linked at the top of the article, then find the nearest point.  If that point is a banana, call the unknown fruit a banana, otherwise call it an apple.  As a generalization, take the k nearest "neighbors" to that point instead of just the nearest one, and let them vote.  So if the 3 nearest points are two apples and one banana, you'd guess that the point is an apple. Advantages: No training time required. Very simple, but often effective. Disadvantages: For complicated problems, finding the k nearest neighbors could require searching all of the points. Breaks down for difficult problems. Neural Networks - the generalization of a perceptron. Can learn much more complicated functions, including things that aren't linearly separable. Advantages: Can be very accurate. While training is slow, classifying a new point using a neural net is usually very fast - much faster than kNN at least. Disadvantages: Training can be very slow. Training is not deterministic.  You may have to try training the neural network dozens of times before it works. May not generalize to new data well - can often underfit or overfit the data. Support Vector Machines - relatively new compared to kNN and Neural Networks, SVMs are the best known example of the class of maximum margin classifiers, which are supervised learning algorithms that try to in a sense find the most general possible mathematical fit for the training data.  Visually, they try to find two parallel lines, instead of one line, such that all of one class (bananas) is on one side, the other class (apples) is on the other side of the other line, and nothing is in the middle.  That's oversimplifying, but the idea is that it tries not to make any assumptions about parts of the space it doesn't know about.  SVMs work better than just about any other method; the main problem is that they're still new and not as popular yet, and they're still slower (more computationally challenging). Advantages: Best accuracy on new test data. Training algorithm is deterministic and guaranteed to converge to the same optimal solution every time. Disadvantages: Slow to train and slow to classify new points (relative to previous two algorithms) Not as popular and well-known, means it's harder to learn how to use them properly. For more information on Support Vector Machines, check out libsvm. For more information on Machine Learning in general, check out Machine Learning on the Wikipedia.
 KNN search (none / 0) (#46) by Arkaein on Sat Nov 22, 2003 at 12:38:33 AM EST

 Finding the K-Nearest neighbors in a dataset should not be all that difficult (you certainly do not need to search each datapoint!) First you need to orgainize the data in a searchable structure. A KD-tree works nicely and is a structure eneralized to any number of dimensions. Then we need a search algorithm. This is a little but trickier, but basically we start with the root node of the tree (which is more or less at the center of the bounding volume of our dataset in n-space, depending on actual implementation) and search the tree, building a queue of the closest subtrees. As we descend the tree the quickly eliminate most of the data as being too far in n-space from the target point to be one of the k nearest neighbors. Eventually every subtree has been eliminated or its data point examined and kept or discarded as being one of the k nearest neighbors. In practice only a tiny fraction of the number of data points need to be checked (though the numbers are worse the farther the test point is from the existing set). ---- The ultimate plays for Madden 2003 and Madden 2004<[ Parent ]
 Re: KNN search (none / 0) (#47) by dmazzoni on Sat Nov 29, 2003 at 03:48:27 AM EST

 Finding the K-Nearest neighbors in a dataset should not be all that difficult (you certainly do not need to search each datapoint!) That's totally correct - for many problems, a KD-tree can help you find the k nearest neighbors in a fraction of the time. However, there are some problems, some of the most interesting ones in fact, where the dimensionality is very high and KD-trees don't end up helping at all.  I've seen plenty of examples where a KD-tree ends up touching half of the datapoints each search - and because of the overhead of the KD-tree, that ends up being SLOWER than just doing the linear scan!  In a case like this, it's usually necessary to settle for approximate nearest neighbors instead of exact ones. [ Parent ]
 Nice and neat; but how about ... (none / 0) (#43) by tyroneking on Thu Nov 13, 2003 at 05:53:00 PM EST

 ... walking a bit further down the path and doing an article about multi-layer networks next - I see a useful series in the offing. Don't forget WISARD (great for broken biscuits I hear)...
 what is WISARD? (none / 0) (#44) by llimllib on Thu Nov 13, 2003 at 09:21:43 PM EST

 and I hope to do more articles. Peace.[ Parent ]
 What is WISARD? (none / 0) (#45) by tyroneking on Fri Nov 14, 2003 at 08:01:25 PM EST

 Oh boy, am I not an expert on this, but here goes (and apologies in advance) ... When I did a short course on NN's at college we covered perceptrons, back-props, Khonen NNs and Boltzmann etc. Can't say I really understood it all (apart from the Boltzmann idea which was really neat). We also spent some time on the WISARD, which was a hell of a lot easier to understand. Here's a link to a simple page about WISARD with a groovy java applet http://richardbowles.tripod.com/neural/wisard/wisard.htm Probably what this page misses is the fact (at least in my mind) that WISARD recognises shapes regardless of their orientation, so it's good for recognising broken biscuits in a factory that makes, err, biscuits. It is also good at spotting burglars moving around in a truck depot at night. Not sure why those two things fascinated me so much, but they are so much more interesting than predicting diabetes in a prone population or cardiac arrests in chest pain admissions. WISARD is unique in many ways; first it is weightless, which I guess is one reason why it is very quick to train; second, it uses RAM and boolean logic, so it can be constructed using existing hardware and it runs very quickly. It was one of the first NNs to be implemented commercially. WISARD was developed by a chap with a very cool name, Igor Alexsander, who works at Imperial College, London. http://www.iis.ee.ic.ac.uk/aleksander.html Nothing to do with WISARD, but if you really get into this, and you have loads of free time, you might want to see what Dr. Zuhair Bandar and Jim O'Shea are doing at http://www.doc.mmu.ac.uk/RESEARCH/Intelgrp 'cause I seem to recall them doing something about using NNs to determine if people were lying in interviews ... oooer. As usual with NNs, the theory is oh-so-dull, but the largely un-sung practical applications are mind-blowing. [ Parent ]
 Perceptrons: Intro to Machine Learning | 48 comments (19 topical, 29 editorial, 0 hidden)
 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First