By John Lahr
It’s official, machines that can see (and drive) are now among us. This week Waymo announced its self-driving cars are on the road in Arizona and self-driving startup Embark has autonomous trucks transporting goods from Texas to LA.
Computer vision has been a dream from the 1950’s (when serious research began) up until very recently, despite AI pioneer Marvin Minsky attempting to bring it to fruition in 1966 by famously telling a grad student, “onnect a camera to a computer, and have it describe what it sees.” Poor guy, 50 years on and we’re just scratching the surface.
So why has computer vision been such a tough nut to crack? First, because the process is one of the most complex we’ve attempted to comprehend. Second, we haven’t been able to figure out the body’s process beyond the macro steps.
How then, were we able to “connect a camera to a computer, and have it describe what it sees.” It comes down to solving 3 key parts of the process that is “sight”.
- Seeing images in the first place
- Describing the image and what’s in it
- Understanding what the objects in the image are, and the context around them
What is “seeing” really?
When someone throws a frisbee across a field, before the person on the other end can catch it there’s a lot that happens. The image of the frisbee passes through their retina and some preliminary analysis happens before it gets sent along to the brain.
The first stop in the brain is the visual cortex, which analyzes the image more thoroughly. Once it’s been broken down by the visual cortex, the rest of the cortex compares the image to everything it already knows. This includes classifying the objects in the image, size and dimension, putting what is happening in context, and so on until the brain decides on a course of action: raising your arm and catching the frisbee.
From a computer science standpoint, the three stages of sight are problems that drastically increase in difficulty the further along you get. Replicating the eye is hard, replicating the visual cortex is really hard, and replicating and understanding your previous understanding (i.e. the context) is quite possibly the most complex task humans have ever attempted.
Recreating the eye is where we’ve been most successful. Cameras, sensors, and image processors have not only matched the human eye, but now exceeded it in many regards. We can see at vastly improved distances with greater clarity than ever thought possible, and even see in the dark or other types of light not visible to the human eye. Using increasingly larger and more optimally perfect lenses, combined with subpixels made with nanometer precision, we can record thousands of images per second and see more than ever before.
However, despite the quality and scale of their product even the telescopes we use to observe other galaxies can’t tell what the images they see are without help. It is the software behind the lens that does the heavy lifting – and is the more difficult piece to get right.
What’s in a frame?
So how then, do developers begin to write software that replicates the visual cortex? The first challenge is to differentiate objects, and find patterns in the disorganized noise of an image. Our brains do this by neurons exciting one another if there is contrast along a line, or rapid motion in a particular direction.
The next layer of networks aggregate these patterns into meta-patterns. This continues upward as other networks identify colors, textures, motion and direction. As more information is layered on a picture begins to form from the mess of complementary descriptions.
What is the object?
Once you can find lines and distinguish objects the next question becomes, “What is the object?” The first way researchers tried to tackle this in the early days of computer vision research centered around the problem, “How can we tell if there is a tank in the woods?” (Thanks Cold War).
Researchers started by describing to the computer what a tank should look like. A tank looks like /this/ and moves like /this/, except for when you view it from the side where it looks more like /this/, or the turret is rotated then it looks like /this/, and so on.
For select objects in controlled environments, this brute force approach worked well. The problem is, for this to work at scale every object must be observed from every possible angle with variations for lighting, motion, and every other conceivable factor taken into account. It quickly became clear the data required to correctly identify even just a few objects would be impractically large.