I broke this into segments (Part 1, Part 2, ..., Part 4) so you can take some time to reflect (or go do some homework for school, or whatever ;-) before going on to the next segment, if that works out for you. There are natural fracture planes between Part 1 (deep background), Part 2 (computational issues), and Part 4 (hardware considerations), so those transitions are good places to take a break.
This is optional. There's a lot of cool stuff to learn, but you won't get left behind if you can't get to it or if you get less than all of it. Skim, then go back and dig deeper on the topics that grab you. This is a team effort, so no one person is responsible for all the code, and no one person is responsible for all the technology behind all the code -- but knowledge is power, and the more you understand, the better you can do your part.
Some of you will dive right in and search out what you can find. You will learn far more than you could ever use in this project, and you will be in a position to make design decisions for the whole group.
Some, perhaps most, of you will wait until after school lets out and you have time to dig in, and that's OK.
Some of you have other priorities, or may lack some of the technical background that makes this stuff easy reading. That's OK too, do what you can and ask questions. That's what I'm here for. I want to help you make this a rewarding experience.
If nothing happens between you and this project before July 17, that's OK too. This is a workshop, so there will be things to code, and your other team members will be able to explain what they need you to do.
If nothing at all happens before then, we -- you and I together -- will work things out on July 17, but it might be less impressive than if you took the initiative and are ready to hit the gound running.
If there are a lot of questions in directions I didn't anticipate, I'll
probably come back and edit the new information into this document. Maybe
I'll do that anyway, as I think of more stuff. Think of this as a work
in process -- sort of like your project.
How Do We Recognize Objects?
Visual Object Recognition, Lecture 1, Lecture 2, Lecture 3
How Does the Brain Solve Visual Object Recognition?
MIT Vision Course, #5
and links to some software efforts: Recognition: Objects, Humans, Activities
Here are a couple of tutorials to get you started learning about
Neural Nets:
Deep Learning Simplified
Andrew Ng's Stanford Lectures
The problem with neural
networks is that (not entirely like human learning) they take a huge
amount of training. The National Institute of Standards and Technology
(NIST) collected 60,000 scanned images of handwritten
numerals for training neural nets (randomized sets described
available
here). The fundamental difference between machine and human learning
is that neural nets (so far) are more like B.F.Skinner's
conditioned response model of human learning, which are inadequate
for the full range of human activity -- consider that it cannot explain
Einstein's famous "thought
experiments." Although some autonomous car projects are using
neural net methods, the best human drivers are also taught to think
about their possible driving situations and plan for eventualities that
they never experienced. There is no way for conditioned response (and neural
nets as presently implemented) to do that, although "deep learning" is
trying to overcome it.
In any case, we do not have sufficient data sets -- 60,000 images to recognize numerals, how many more to recognize pedestrians? -- nor the time to do the necessary training in one summer workshop. On the other hand, we can make a good dent in the problem by thinking about what needs to be seen. That is the approach I recommend, but as they say, "Your mileage may vary."
Objects in a visual scene might be distinguished by two -- maybe three or four -- characteristics:
1. They have a "color" distinct from their surroundings, and
2. There is an edge surrounding them separating their "color" from that of their surroundings.
3. Some objects have a typical shape -- including the shapes and placement of component parts -- by which they can be recognized.
4. Objects that move over time have a third quality (that motion), where the "color" and the surrounding edge (both shape and component parts) move together and separately from the rest of the scene.
In more detail...
1. I quoted the word "color" because this is more than just the average color of the object (like what you see when they show a photograph or video of a person on TV but blur the face into unrecognizability), although that is certainly important.
The texture of that coloration, whether stripes or dots or some other repeated pattern, is part of the "color". A honeycomb is a uniform array of hexagonal cells, which can be recognized apart from its golden color. A brick wall is the same geometry, except that it has been flattened in one dimension to form long straight lines of bricks, and where the traditional over-all color is dark red. Dots and stripes usually come with one distinctive color for the dots or stripes, and another color for the background, but the pattern often dominates over the colors for purposes of recognition. The fact that the pattern is generally constrained to the object and not its surroundings helps to distinguish the object from its background.
The human eye apparently does something like a Fourier transform (we'll get more into that, later) on the texture to recognize repeating patterns very quickly. It also tends to cause us to see uniformity even when there is none, or to see regions of near-uniformity as separate objects. For example, trees might be very nearly the same color overall, but a different leaf size (and therefore pattern size) would help the human eye see the boundaries between those different varieties of tree.
When we speak of "color" in human vision, we mostly ignore brightness. Yellow and brown are the same color, but yellow is brighter. The human eye only distinguishes them under the same light and in contrast to otherwise recognizably the same neighboring colors. This is because often we will see an object in bright sunlight, where one side is yellow and the other side is brown. When that object turns, or the shadow moves, the same object is seen with differing amounts of yellow and brown, but we still see it as the same object. Perhaps the eye's internal processing automatically deprecates brightness in favor of other features.
2. We see "edges" where the coloration and patterns of one part of the image cease, and another begins. Sharp edges make things easier, but the human eye will see edges where none exist, like the colors of a rainbow. We can see objects that have sharp edges but no colors (like line drawings), and also that have distinct colors but no sharp edges (like a photograph out of focus).
3. The human mind appears to be hard-wired to recognize faces: two eyes, a mouth, and sometimes a nose, in a particular arrangement within a more or less oval background ranging from near-white to dark brown, but not blue or green. The colors are less important than the features for recognition. Even the surrounding oval is unimportant.
We also learn to recognize a standard vehicle (two wheels under a rounded or boxy body in the side view), and standard house shapes, as well as trees and animals. An outline need only approximate the stereotype to be recognized as an instance of it. Notice that the house shape example here also resembles a face.
People are generally taller than they are wide, may have outstretched arms, and the legs are often (but not always) separated, with a rounded head on top. The shape recognition is so strong, that we instantly recognize a person as laying down or upside-down if the orientation is not normal. Humans bend in particular ways, so a seated person is again quickly recognizable.
Because we have these mental models of standard objects, we can infer also distance from the relative size of the objects in the image. Cameras have had zoom lenses as far back as most of us can remember, but people do not. That's why the curved mirrors on the right-hand side of most American cars have a warning that "objects are nearer than they appear." We actually make judgments about distance from the size of the object on our retinas.
4. When a collection of features (color, pattern, edges, shape) moves together across a scene, it is immediately established as a single object. When the whole scene grows (or shrinks) or shifts from one side to the other, then the motion is inferred in the viewer, not the object. I have sat in a bus looking out the window at the next bus in the station (the same thing in a train station) when the next bus over starts to move, and because it fills my view, my first reaction is always that "we are moving," even though I did not feel any motion. Movies can induce vertigo using that technique, while the audience is motionless in their seats.
The human eye integrates stepwise displacement as smooth motion somewhere around 20-100ms. At slower rates we can see the flicker, but compensate for it; at faster rates we cannot see it at all.
Any questions or comments? This is your project.
Tom Pittman
Next time: Computational Image Analysis
Rev. 2018 June 22