NWAPW Year 2

Technical Topics

Part 3: What's Next?

If you digested the previous two pages (and their links) you are now ready to make a contribution: What do you think? Where do you want to go with this?

Implementation Strategy

NN. Everybody who's talking about their autonomous car implementation or visual object recognition is talking "neural nets" (NN). It obviously gives you the most bang for the least amount of code, but -- I call this "The Conservation of Complexity" -- you are really trading off code complexity for user complexity, in this case training materials. Where are you going to find video clips (or even still images) already tagged with identifiable pedestrians to train your NN? I don't know of any, but maybe you can find or create them from the ImageNet database. or the Udacity Annotated Driving Dataset. You need thousands (or at least hundreds) of them.

Another difficulty is the processing time to run the data through the NN in real time. Image processing in NNs typically allocate a single neuron to each pixel, and then repeat that in each internal neuron layer. A 640x480 video feed has some 300K pixels, which means you might be driving a million or more (lots more, if you go for "deep learning") neurons. How fast does your average PC process the code for one neuron? If each neuron considers only a hundred inputs, that alone is a hundred multiplications and additions, not to mention comparing and stepping. Assuming a pipelined processor (most modern CPUs are), that's one clock per operation, perhaps 100ns per neuron, about ten frames per second if your processor does nothing else at all. YOLO ("You Only Look Once") is promising 45 or 155 fps, but that's on a GPU. Perhaps you can figure out how they did it by reading the published paper, but the video results are impressive.

So one option is NN.

Existing Image Code. Another possibility that has been mentioned in my hearing is using and customizing existing image processing code to pick out objects. I'm not familiar with what's available, so I cannot comment on what it does or how, but one certainty is that package software is generally designed to be feature-rich, which means it has a lot of code to do things you don't need: it will be very slow. But there is something to be said for not re-inventing the wheel. If somebody else did that, they know how it works, not you, which means when the right solution is to invent a wheel "the same only different," they will ask someone else to do it, not you, because you don't know how.

But adapting existing code is an option you should consider.

Roll Your Own. The alternative is write your own code designed to find credible pedestrians in the raw image data. If you concentrate on blocks of solid color, it's not too hard to handle the live video feed in a small number (dozens, not hundreds) of instructions for each pixel. Unlike somebody else's code, what you write will be custom tailored for just the problem you looking to solve, thereby making it efficient.

Another advantage is that you wrote the code, it does exactly what you told it to do, so if it fails -- make that when it fails -- you can find what went wrong and fix it because you understand every part of the program and the theory behind it.

Besides, it's really satisfying to be able to say at the end when it's running, "That's my/our code."

I've done a lot of embedded system code over the years, so I know this can be made to work. I know where the potholes are, so if you have trouble, I can bail you out and get you going again. Nobody can promise 100% probability of success, but I think 90% is credible in the time frame we have this summer.

So doing it all yourself is one of the options.

Does that sound like a hard sell? It shouldn't. You need to own the project, because you are going to design and write whatever code is involved, not I, so it needs to be something you want to do. If I don't have the nessary experience to support you, we will find somebody who does. That's a promise from Steve, and he knows how to make it happen.

There may be other options I didn't think of. Use the reply form below to choose one of these strategies, or to suggest something else.

Implementation Parameters

After you/we decide what your over-all strategy is, you might start to think about what limits you might want to set on what you are doing. How much resolution? Remember, the more pixels you have, the longer it takes to do one frame. Computers are not infinitely fast -- you probably know that from video games, the better the image quality, the lower the frame rate (unless the developer is really clever, or you spend a ton of $$ on a video board, or both ;-) You need to think about the data rate no matter what strategy you choose.

Anyway, with smart phones driving the technology, high-res cameras are getting quite affordable, but implementing your project in a smart phone may not be the best tool for the job -- PCs are generally faster and easier to debug. 320x240 and 640x480 surveillance cameras are reasonably priced and plug directly into a computer USB port. You could even use the built-in camera on most laptops, but programming it might be tricky (the APIs are not as public as 3rd-party cameras), and getting a stable mount while driving might be awkward.

Use the reply form below to tell us what you think about the video input should be to whatever strategy you have chosen.

Or offer your thoughts on only one of these issues. At a time. If you have more thoughts tomorrow or next week, that's OK, share. More is better. The really good contributions, or if you find better links than I have, I'll go back and work them into the text. The rest should show up here in the comments section within a day or so:

We have a lively discussion on NNs, so I moved it to a separate forum here. We can keep this forum for the folks who want to get their hands on real code.

Comments (moderated)

Jonathan wrote [2017 Jun 12]:

I'm sending this to you since the comments box seems to have a character limit below what I wanted to post. Hopefully that's not too much hassle!
In our CS class, we did our handwritten digit processing using about 1800 images, but I get the point: a lot of data is needed to train neural nets. The thing about the data we used is that we didn't collect it ourselves: it was preprocessed data that was freely available on the web. I'm curious if any such data might be available for our project -- if not online, perhaps from a company with experience in this. Maybe Polysync? Even outside of that, it might be useful to talk to them to see what their software stack is like, roughly.
Another cool thing about the data we used was that it was preprocessed for us. Part II discussed some of the processing steps we'd need to go through -- neural net or no -- before digesting the data. It seems that writing that processing ourselves would likely be time consuming and not that much fun. You linked to a video by OpenCV (a library I'm somewhat familiar with from robotics); the tools it provides are from my experience quite versatile and useful with minimal overhead (including the preprocessing and grabbing webcams with very little hassle.) They've also got OpenCL/CUDA support for GPU processing. AMD and NVIDIA have both peen pushing their GPUs for neural nets; I can't say I'm experienced enough to know if there'd be a performance bottleneck there though.
I think in general, using existing tools will allow us to do cooler things faster. Of course, that isn't to say we shouldn't understand how those tools work: how else would we utilize them properly?
-Jonathan E.

Tom Pittman replied [2017 June 14]:

The character limitation seems to be a problem in the client browser. From my computer Jonathan's entire text came through just fine, but OSX/Safari does indeed have a 1K limit. The usual response I encounter for this kind of problem is for the implementor to say "Get a better computer," but that option was never really viable, and in this case is not even available. When you use other people's tools, you tend to be stuck with what those tools do for you.
Email clients apparently have fewer limitations than browsers, and it all arrives in my email inbox the same. Use whatever works for you.

Tom Pittman replied [2017 June 17]:

Steve asked that you decide which of two technologies to work on, and to decide in the next four days. You might feel like you don't have enough information to choose one or the other, so I'll try to fill in some of the gaps.
The two technologies are conventional (designed) code vs neural nets (NN).
Code means thinking about the problem of identifying pedestrians, then writing a program or programs in Java to do it.
NN uses a very tiny code engine, and most of your effort will be preparing the data images for training it, and tweaking the formulas and perhaps the configuration so it learns better, but the code is essentially unchanged. You would still be "programming" but it's at a very high level and doesn't feel like coding at all.
The first thing the Code group needs to do is decide how many pixels and what frame rate to run at, and agree on the algorithm to pick out of the images what might be pedestrians, then start coding. It's all software and the tools are available now and known to work together. The work will be similar to other programs you've written but with more attention given to how your choice of methodology affects performance, then dividing the project into sub tasks so groups of two or three can code and test their work in parallel before you put it all together into a system.
The first thing the NN group needs to do is decide whether you want to compete with the Code group, or take up Steve's suggestion to use the NN for track following (steering the car).
Jonathan seems to think -- correct me if I'm wrong -- that the pedestrian data is available. If that's true, the the option to compete and see which technology is best at identifying pedestrians may be viable. If not, the time and difficulty to create the required training image set makes that choice problematic.
Track-following image data you can pull off of a video camera while driving around the track, and you can tag it for training purposes automatically by recording the steering wheel in the same video. At least one other autonomous vehicle project, the guy admits to training his NN this way.
The NN group needs to decide before the end of June whether you are doing pedestrians or track-following, or you probably won't have much at the end of summer but good vibes to show for your efforts. That effort is not going to be a four-week frolic in the park, you will need all the time you can put into it, mostly doing non-coding things, but which done well can be very rewarding.
I'm here to support you, whichever group you choose. I understand the code issues along both paths, and Steve has the management experience to help point you in the right directions for the non-technical issues. Whichever way you decide, it's going to be challenging and fun.
If you still have questions, ask! Post them here (other people might have the same questions), or if you prefer, send me a private email (tell me it's private so I don't post it), and I'll answer in private. I'm here to support you.

Sofie wrote [2017 June 20]:

Are we going to be using any particular libraries for this project? Also, I want to know more about which programming concepts will be useful for this, I'd like to do a little research. I'm still a little bit unsure what is the best direction to take right now. Thank you,
Sofie Christie

Sofie wrote [one hour later]:

Some questions I have about some the information you had:
I'd like to know more about how the human eye uses parallel processing, I found that note a little confusing. What allows us to spot patterns better than computers?
Could you explain data in the frequency domain more? And why does luminance need to be deleted from data analysis?
You relate color to sound when explaining Fourier Transformations, and I'm still a little bit confused on that relationship.
Could you explain how the design of DSPs lead to algorithms that lead to a lot of multiplications?
There are still some other things that I don't understand the language of completely. Is there a good place for understanding some of the terminology you use better?

Sofie wrote [one hour later]:

I'm interested in rolling my own code, but I think it will be useful to look at other examples so it can be analyzed. This would be for identifying the pedestrians. I do want to get more familiar with the language though, so I'd like to know more about what will be important in the code. I think I'll follow your advice about surveillance cameras, if we can access one.
I am a little bit nervous about the idea of NN, but it seems to have many benefits. I'm not familiar with this idea, but would you be teaching us how to use it? What skills would be useful to have if we were to take on this project.

Tom Pittman replied [2017 June 21]:

Sofie asked some good questions, let's see if I can answer some of them.
Are we going to be using any particular libraries for this project?
The short answer: Only if you want to.
Except for complicated or esoteric stuff like math and user interface, it often takes almost as much effort to use somebody else's library as to write your own, but your mileage may vary.
Also, I want to know more about which programming concepts will be useful for this, I'd like to do a little research. I'm still a little bit unsure what is the best direction to take right now.
That depends mostly on what you and the rest of the team decide. Mostly I'm thinking the coding skills you already have -- arrays, for-loops, method calls -- will do you well. If you decide to go with NNs, I probably can't be much help. They look simple, but if you just do what the NN proponents tell you, it doesn't work, there's some secret sauce they aren't telling us about. I would blame it on entropy, but only because I couldn't make my attempt work. YMMV.
I'd like to know more about how the human eye uses parallel processing
Me too. And so also half of the biologists out there. I think I was taking psychology in college, and they told about opening up a frog and measuring the nerve impulses in the optic nerve, from the frog's eye to its brain. In the quiescent state there was a small amount of activity, perhaps just enough to let the brain know that there's a scene out there, nothing to worry about. The activity increased when something entered the scene, a lot more if a large dark object entered from above (somebody was about to step on the frog, time to leave), and a huge amount of nerve activity if a small dark object appeared directly in front (lunch!). All this was happening in the eye, before any signals got to the brain. AFAIK, nobody knows how that works.
Could you explain data in the frequency domain more?
Not really, but there is a lot of stuff on the internet. Follow the links I put out, then pick out terms you don't understand and Google them. You have to wade through a lot of irrelevant stuff, but the good stuff is there. Google puts a lot of research into making the good stuff filter up to the top of their hit list (right under the paid ads ;-)
You relate color to sound when explaining Fourier Transformations
That's frequency domain. Sound is a repetitive pressure wave, up and down, up and down. Some images are repetitive (either horizontally or vertically, or both), light and dark, light and dark, so you can use the same algorithms for analyzing them that the audio engineers use for sound. I think it's just an academic exercise, because nobody seems to find any use for it.
Could you explain how the design of DSPs lead to algorithms that lead to a lot of multiplications?
I think it's the other way around. There are a lot of multiplications in sound and image processing, so the people who design digital signal processors need to do the multiplications very fast. Then because the DSPs do fast multiplications, the software people feel comfortable doing a lot of multiplication. Parkinson's Law ("expenses tend to rise to meet or exceed income") applies: the more multiplications your DSP offers you, the more you will use them.
Is there a good place for understanding some of the terminology you use better?
You are doing it: Ask questions. The worst that can happen is I might say "I don't know." Then you might have to go ask Google. Google knows everything, but it's sometimes hard to think up the right way to ask. Half of all Google searches fail, even when experts are asking, even when the data is there. You just gotta keep trying. Sometimes what you did find includes some search terms you didn't think of. But I try not to use a term I cannot explain, except maybe to say so.
I'm interested in rolling my own code, but I think it will be useful to look at other examples so it can be analyzed.
I agree. I will try to make sure you have appropriate examples to look at when we get there. First we -- you and the rest of the team -- need to decide what you want to be doing. We talked a little about that at the meeting in March. We can do more here, in this forum.
I am a little bit nervous about the idea of NN, ... would you be teaching us how to use it?
I'm a lot nervous about NNs. I found on the internet how to code it, and I did that, but I did not find how to make it actually learn, just a lot of hand-waving (that's mathematician-speak for "I can't explain it, but I don't want to admit it").

Steve Edelman wrote [2017 June 21]:

Sofie,
I'm going to chime in on a couple of your questions too and elaborate a bit where I think it might be helpful.
As far as libraries the thinking is that we can code everything necessary to achieve the objective ourselves. The advantage of that is when stuff doesn't work, since we wrote it, we can fix it. With other people's code you run the risk that when there is a problem you have to wait for them, which given the schedule could easily kill the project. So could you use libraries? Sure. Is it a good idea? That depends upon whether there are going to be problems. Since you can't know that, at least without an investment of time to find out, then one of the first things the team will do once the parameter of the problem have been decided upon is divide the work into tasks and size the amount of work. Based on that one makes a build/buy (or use someone else's) decision. It will be up to you guys. We're here to give advice to the extent you want it.
As far as the frequency domain. This is one of those concepts that can be hard to grasp but once you do the sky opens and it seems entirely obvious and natural. My suggestion is to watch the videos and see if they help. If they don't, not to worry because we can cover this one of the days during lunch. Everyone gets it, it just takes different way for it to click with different people.
One further suggestion. Stick with sound for now. Trying to understand FT as it relates to images is almost impossible until the idea clicks but once it does it will be trivial. So spend your time with the videos that talk about how music. They use the term harmonics or overtones to explain why two notes sound different on different instruments, or how you can make a square wave out of a bunch of sine waves. See if you can make sense of that.

As for the terminology please post the specific terms you are having trouble with. I'm sure others are having the same problem. The issue from our end is that we don't know which terms are going to cause confusion because we don't know which you will have encountered before so it a balancing act on our end that there is no way for us to get right. Rather than get all wordy on you guys Tom has kept it tight assuming that what isn't clear you'd ask about.
It's possible that a whole lot of what's here isn't clear and that fine. Ask away and we'll explain. So start with those terms you aren't getting and we'll take it from there.

Spencer wrote [2017 June 21]:

Would it be appropriate to use an FPGA to handle some of the image processing? It would be able to perform some of the operations much more time efficiently than some kind of microprocessor.
I haven't worked with them before so let me know.
Spencer Hutchinson

Tom Pittman replied [2017 June 21]:

I had not thought of Field-Programmable Gate Arrays (FPGAs) in this connection, but now that you mention it, I doubt if they would confer much benefit. FPGAs are very good for time-critical applications where you are doing a lot of complicated logic but no arithmetic. Image processing tends to do a lot of multiplication and addition, not very large numbers but repeated over thousands of pixels. A vector-processing DSP (Digital Signal Processing CPU) would be more appropriate if we had too much data and too tight a time schedule to do it in a conventional computer. Unless we are overly sloppy, I don't think we are likely to exceed our CPU budget.
I did a FPGA project many years ago where I pulled the data out of memory and converted it into time-division color channels to generate an NTSC (old analog) TV signal complete with timing signals on the line and frame ends, color burst, stuff like that. It was all on or off, so there was no math to do, just decide when the signal had to go up or down.

Steve Edelman wrote [2017 June 22]:

I'm going to second Tom's answer but from a different perspective. I've done a number of projects using FPGAs including extremely large projects with millions of gates. The main problem with them is that the debug time is not short so you really only want to use hard coded logic like this when everything else is too slow.
Think of microprocessors as being at one end of the spectrum in terms of speed (slow) and ease of programming (easy). Hard wired logic is all the way at the other end, very fast and extremely hard to program (or debug). There is a whole lot of stuff in the middle including GPU's and as Tom mentioned DSP's. The reason these exist is that somebody was trying to take what used to require hard logic and create a solution that had the necessary performance to replace it but that was much easier to program. Hence GPU's replaced the logic that used to be used in video cards and DSPs replace the logic that was used for all kinds of stuff but in particular processing audio.
However I'd like to make one more point regarding your question. You ask if it would be appropriate to handle some of the image processing in an FPLA. Let me restate that to ask would it be appropriate to handle some of the image processing someplace else than the main CPU? The best answer I can give is "it depends."
For our purposes, which is to say a first pass at a solution, the answer is probably no. If we can achieve the desired outcome using the CPU then there is no reason to add any additional hardware regardless as to whether it's a GPU, DSP or FPLA since any of those adds cost, complexity to a system that is performing the required function.
But the reason are going to be able to achieve this is we are using a very expensive CPU. What if this was going in a commercial product? Might the total system cost be less if we opted for a far less powerful CPU combined with a DSP? The answer would likely be yes since Intel charges a large premium for their performance over CPU's that don't use their instruction set. Likewise DSP are manufactured by that gazillions for use in all kinds of stuff from headphones to cellphones so they are not only cheap because of economies of scale but because they are designed to do some very specific tasks using only a very small amount of silicon area. So if you are going to make a million or a billion of something you can and should spend a lot of money up front to design something so each one you build is very cheap. But if you aren't going to build very many then it's better to minimize the cost of design even if that means the cost of each unit is more.
Having said that one can in fact have it both ways. You design your system for ease of implementation but build into your architecture how you might redo it if whatever you're building is a wild success and somebody comes and asks you what it would require to cut the cost by say, two thirds. A software engineer who has done that is going to be a real hero so as we develop this system making it modular so and understanding where the bottlenecks are and how the tasks might be offloaded so that the entire system would be a lot cheaper to build is certainly a very laudable goal.
Development time, power consumption and cost are the three big variables in a project. You never get to minimize all three at once. It's hard to minimize even two. For this project we are only going to try and minimize one, development time but that doesn't mean one can't be thoughtful so that it's a lot easier to come back and work on the other two later.

Jonathan wrote [2017 July 5]:

In Part 4, several specific cameras and their API are detailed, and it seems like these are the ones that are currently intended for use with this project. Why these cameras?
-Jonathan E.

Tom Pittman replied [2017 July 5]:

1. They had sufficient resolution (see Part 4 Image Requirements).
2. They could easily interface to a computer.
3. There was documentation and support to enable writing a Java wrapper so it would be easy to interface to.
4. They were light so they wouldn't bounce around too much. And rugged.
5. There was a wide choice of lenses available, including zoom lenses so we could choose our viewing field.
6. The manufacturer had good support, which we tested by asking them numerous questions before purchasing.
7. They were reasonably priced.
Bonus. After we purchased it we discovered that one of the open source autonomous vehicle projects chose the same camera and designed their own mount for it. See https://www.udacity.com/self-driving-car. While using the same camera is not essential, it could be helpful. If nothing else it probably validates our decision as to the specifications necessary, resolution in particular.

A lively email discussion erupted on the topic of NNs, and it was suggested I should post the relevant parts here for everybody to read, starting with our APW-2 Teaching Assistant, Merlin Carson. It has all been moved here.

Your comments:

Tom Pittman

Next time: Hardware Considerations

Rev. 2017 July 8

Name *
Email Address *
Comments *