The New Kid and the Old Hand

A Cautionary Tale

I'm a Transportation Systems engineer, specializing in computer systems and programming for traffic monitoring and control. In other words, I write programs which are used to try to keep traffic flowing smoothly. I used to work for the Traffic Department in Chicago, but now I mostly do consulting work. I really like consulting; I get to travel to different cities, work on lots of different projects, work with lots of different people, and, best of all, they pay me lots of money. And since I'm in high demand, I get to pick and choose what I work on and when I want to work (can you say "six month vacation?").

Now, I'm not one of those "consultants" who come in for a couple of weeks, ask a bunch of silly questions, make everybody fill out stupid questionnaires, then write a report telling you what you need to do to improve your "process" or your "customer centric orientation" (which report then either sits on a shelve gathering dust or is used as an excuse for yet another reorganization). No, when I'm called in I am the process. I do the actual work. And I guarantee my work -- if you're not satisfied, I'll stay and keep at it until you are.

So a couple of years ago I took a job installing a monitoring system on a bridge on a major artery in Portland. The idea was to watch the traffic flow on the bridge, and if things ever got backed up, alarms would go off in the city's traffic control center, and they could dispatch a team to clean up the mess. It turns out that Chicago has just such a system in operation; Portland was buying all of the plans and software from Chicago, and they hired me since I'm familiar with the systems and software Chicago uses. I figured the job would be fairly simple since I had actually worked on the initial phases of the project in Chicago, and I knew the guy who had designed it, and I knew that it worked pretty well. All I had to do was make sure all the sensors got installed and hooked up properly, and adapt the software to Portland's computers. Piece of cake. No brainer. I could do it in my sleep.

So the Portland Traffic Department received all the plans and specifications from Chicago and started installing the sensors on the bridge while I worked on the software adaptation, writing tests and running simulations as I went. I got a little concerned as I dug into the code because it was radically different from what I remembered when I had worked on it in Chicago, but at least the core monitoring routines were largely intact as far as I could tell.

The whole thing proceeded fairly smoothly, and after only about 6 months of work we were ready for the first live tests. Flipped the switches, plugged in the cables, and we were on line. No problems (I do good work). It actually seemed to work; we were getting alarms when traffic backed up, and no alarms when traffic was flowing smoothly. Oh, there were a couple of glitches, a couple false alarms, but that's only to be expected with system of this complexity -- there are always parameters which need to be tweaked and sensors which need to be realigned or replaced.

Everybody was happy. Portland used to have a guy out there on the bridge during rush hour to monitor traffic (which has got to be one of the most boring and uncomfortable jobs in the world, especially when the weather turns bad), but now they could just sit back and relax in the comfort of the traffic center. They even started using the system for their traffic reports -- if the system said traffic was running smoothly, that's what they reported; if the system said there was a backup, that's what they reported.

Then things started going wrong. First it was the false alarms. The system would tell them there was a backup, they would dispatch a team, but when they got there traffic would be flowing just a smoothly as you please. So I kept raising the trigger thresholds. Then the system started missing traffic tie-ups. They would get a call from one of their roving monitors stuck on the bridge, and the system would insist that nothing was wrong. The worst one of these was when they got a call from the head of the Traffic Department; he had been stuck on an approach to the bridge for half on hour and wanted to know where the hell the response team was. So I lowered the thresholds back down. It got so I was changing the thresholds on a daily, then an hourly basis to get the system to react to what we knew was really happening on the bridge. Not a good situation. And we started getting false alarms in the middle of the night, when there was absolutely no traffic on the bridge. And finally the system started freezing on us; the only way to get it unstuck was to reboot the entire thing.

So I started adding some monitors to the code. Now, you have to understand that, fundamentally, monitoring traffic flow on something like a bridge is not that hard to do. You need some reliable sensors which can detect vehicles, and you need a method of collecting the data from the sensors and time stamping it; then every time you detect a vehicle entering or leaving the bridge you calculate the rate of traffic entering, the rate leaving, and the number of vehicles currently on the bridge. You use these values to tell whether or not there's a backup on the bridge. There are some special cases to deal with, plus you have to check the state of things even when there are no vehicles entering or leaving, but that's basically it. Not that difficult.

Every time the system detected a backup I had it dump what it thought the current state was -- traffic flow in and out, number of vehicles on the bridge, and so forth. I also set it up so I could dump out this information on demand, and during rush hour (when we had the most trouble with the system, naturally) I had someone out there on the bridge monitoring the situation. Whenever the monitor reported traffic problems I would dump the state. Then finally, for one 48 hour period I had it dump the state every five minutes.

We're talking a lot of information here. When I started wading through it all, some very strange things showed up. For example, the 48 hour monitor showed some high traffic rates between 3 and 4 a.m., when we should have expected almost no traffic at all. And then there were those negative vehicle counts -- there were times, particularly during rush hour, when the system would think there were a negative number of vehicles on the bridge. It just didn't make sense. So I redid the 48 hour experiment, but this time I had them place traffic counters on the bridge (you know, those boxes they chain to the telephone poles with the hoses taped across the road) so I could have something to correlate the dumps with. When I got the results from the traffic counters they made even less sense. I had expected that the counters and our system would report different numbers, with maybe a 5 or 10 percent difference, but these weren't even close. We were showing a 50%, 100%, sometimes even a 200% mismatch between the two. And those 3 a.m. traffic bumps -- they didn't exist. Total figments of our wonderful traffic monitoring system's imagination.

Well, I knew I needed help at this point. There had been two names all over the code we had received from Chicago -- the Old Hand, the original designer and the guy I had worked with before, and the New Kid, who was the one currently in charge of the project. The Old Hand had written most of the core monitoring routines, while the New Kid had written most of the newer stuff (the stuff that I had not recognized), plus he had modified the core routines to hook into his new stuff. I had not bothered much trying to understand either the new or the old stuff, since I assumed it basically worked, but clearly it was time to dig into it.

So I gave the New Kid a call. I explained who I was, what I was doing, and described the problems we were seeing (I was kind of hoping that he had seen and fixed them already). Of course at first he was rather defensive about the whole thing, and of course he denied that they had any such problems in Chicago. But I kept explaining that it was probably something I had done wrong in adapting his system to a different bridge and a different computer system, and that I was just trying to understand what I had done wrong. He did give me a couple of suggestions, and he did tell me that the 3 a.m. traffic was probably the result of the system's cleanup task which ran during low traffic periods (and hence, as I suspected, not really there), plus he gave me a very rough outline of how the system worked. He also described the system not as a traffic monitoring system but as a "vehicle monitoring system," which seemed rather strange to me.

So, based on what the New Kid had told me I started digging into the code, trying to find where the problems were. I concentrated on the core monitoring routines first, since I figured if they didn't work nothing else would. And I started noticing some strange things. The core routines dealt with vehicles, but the interfaces to the newer code all seemed to deal with "axles" and "wheels". There were complicated routines which seemed to translate vehicles into wheels and axles and tried to identify which wheels and which axles went with which vehicles. And each vehicle was assigned to a "lane", even though this information didn't seem to be used for anything.

For that matter, the original core routines seemed to have had no concept of vehicles beyond simple counts. I could see where the routines had been changed to accommodate wheels and axles and to keep track of individual vehicles, but I couldn't quite understand how it all fit together.

I added some more monitors to the code, especially at the vehicle/ wheels/axles interfaces, and ran the tests again. What I saw made even less sense than my original data. I saw vehicles without wheels, wheels without axles, axles without vehicles, vehicles "stranded" on the bridge for hours, then it would all get cleaned up by the cleanup task. Sometimes, the cleanup task would run during high traffic periods, and then there would be negative vehicle counts and the system would go crazy. It was a complete mess.

It was time for desperate measures. So I called up the Old Hand and arranged to meet him in Chicago; over a couple of beers he told me the full story of this wonderful traffic monitoring system.

As I had remembered, when the system was first designed and installed it had worked wonderfully. Traffic tie-ups were detected almost flawlessly, there were no false alarms, they were even thinking of tying it into an automatic system to control traffic signals throughout the city.

But after the system had been working so well for a couple of years, the Chicago Traffic Department decided that they wanted more information from their monitoring system. They wanted to be able to monitor the traffic load on the bridge so they could predict when they would need to do maintenance. And they were thinking of installing similar system on a four lane bridge (the bridge the system was designed for was two lanes).

So they hired the New Kid to do this work. He had good credentials, with training in Transportation Systems and Computer Science; he ha studied "smart highways" and was up on all the latest object oriented programming techniques. By way of contrast, the Old Hand was trained as a chemist. Go figure.

Apparently, he looked at the system, looked at the new requirements, and decided that the thing to do was to implement a complete vehicle identification and monitoring system on the bridge; that way he could detect a vehicle when it entered the bridge, identify its characteristics (including its presumed weight), and track its progress across the bridge. By monitoring each vehicle, he could (presumably) tell when traffic was stalled, and even (again presumably) which vehicle it was. I'm told that he predicted that he would be able to dispatch the response team with the problem vehicles' make and model.

In order to do the vehicle identification, he needed more information than the old vehicle detection sensors could give him, so he decided to install new wheel detection sensors on the bridge. These sensors are actually pretty neat; they can detect the wheels of a vehicle as it passes a point on the road, and tell you how many wheels are on each axle. Unfortunately there are some problems with them -- for example, what happens if a wheel bounces off the road at the detection point, and how do you differentiate a wheel bounce from a three wheeled vehicle, and how do you deal with motorcycles and bicycles. For these and other reasons most installations have a series of these detectors, and a vote is taken amongst the detectors. I've seen some very successful installations of his sort. But a series of wheel detectors doesn't work well when the high traffic density is high, and it doesn't work well when the traffic speed it low, both of which conditions happen routinely on a bridge in a dense urban area. So the New Kid decided to try something new -- he was going to use the existing vehicle detection sensors in conjunction with a single set of wheel detection sensors; by using the data from both of them he would be able to fill in anything missing from either of them.

So that's what he did. Or tried to do. He designed wheel objects, axle objects, vehicle objects, and constructed a "tracking object" which was built from all of the above; then he constructed an elaborate database system to accumulate the information contained in these objects. He modified the code monitoring routines to use tracking objects instead of simple vehicle counts. Then he installed it all and turned it on.

The problems began almost immediately. Because the vehicle detection sensors are faster than the wheel detection sensors, sometimes the system would be tracking vehicles for which it had no wheels. Sometimes it would have wheels for which it had no vehicle. Sometimes a vehicle would enter and exit the bridge before the system could detect its wheels. Sometimes the wheels would be detected on entrance but not on exit; sometimes it would be the other way around. The result was that unattached objects tended to accumulate in the system, which ended up crashing repeatedly.

To fix these problems the New Kid added the cleanup task to get rid of these accumulated objects. Unfortunately the cleanup task sometimes did its job too well: objects would get deleted which were still in use. This resulted in corrupted data, so he added all sorts of extra checks to the code to make sure of the integrity of an object before he tried to use it.

It also became very difficult to trace the operation of the system. All of the various pieces of the system communicated (in true object oriented fashion) via messages of various sorts. There were soon so many messages flying around that the messaging system had to be expanded. Plus, if a subsystem received a message it didn't expect, it was almost impossible to figure out where the message came from.

Whatever the problems, by patching here and tweaking there, adding flags and sanity checks, the New Kid was able to get the system limping along well enough to go online. Or so they thought. So they plugged everything in and let it go, and sure enough it seemed to work, and the Traffic Department started getting monthly maintenance prediction reports, and the New Kid was a hero.

But then the false alarms started. First during the post noon lull, but then in the middle of the night. Sound familiar? So the New Kid started raising the thresholds until the false alarms went away, but then the system started missing traffic tie-ups. Sound familiar? At this point I asked the Old Hand how they worked around the problem. "Oh, that," he said. "They placed a video camera on a nearby building, and fed the signal into the central traffic control center, and during the day a guys sits and watches the monitor and reports any problems. He can even tell the response team the make and model of the of the problem car. Hell, he can even tell them how many feet from either end it is. Only cost them a coupla hundred thou to place the camera and route the cable, and watching the monitor is a real cushy job for someone with a lot of seniority who doesn't want to go out on the road any more. So the union is happy, management is happy cuz they get their high tech maintenance predictions, and basically the system works. But you know, I don't believe management knows about the camera thing; I think they snuck it in as part of some general traffic monitoring system."

So when Portland went looking for a traffic monitoring system for their bridge, Chicago had this wonderful system all up and working that they were very eager to sell. Of course they didn't mention the camera business; I'm still not sure if they didn't know about it or if they were hoping that if they sold it to someone else, maybe that someone could actually get it to work.

In any case, once I heard this story I knew exactly what I had to do. I flew back to Portland, and, with the help of the Old Hand's notes (he gave me a copy), starting ripping things out. I got rid of the wheel objects, canned the axle objects, trashed the vehicle objects, went back to using simple counts from the vehicle detectors to keep track of vehicles on the bridge. Then I shot the cleanup task; without all those objects floating around it had nothing to do. It took me only a couple of days to do all this, and when I was done I put it together, plugged it in, and let it run. Worked almost flawlessly the first time. A couple of false alarms, but I adjusted the thresholds and the problems went away. No more false alarms, no more missed tie-ups.

But I didn't stop there. There was all of this data accumulation and report generation code floating around in there, and it was all perfectly good (the New Kid had done a good job on that part), so I decided to use it. Using the wheel sensors, I started accumulating data on the number and types of vehicles on the bridge, but instead of trying to positively identify and track each vehicle, I just lumped the data into "small", "big", and "really big" piles, then labeled the piles "cars", "trucks", and "semis". It may not be 100% accurate, but it's pretty close to what actually drives across that bridge. So now I can give the Traffic Department reports not only on traffic flow, but also the approximate makeup of that traffic. The Planning Section loves this stuff. And it turns out I can give them maintenance predictions too, based on the total vehicle counts and the approximate vehicle mix.

So my reputation is intact, and Portland is happy with their new system, and I even heard that Portland is trying to sell their "improvements" back to Chicago.

Now, I'm not trying to knock the New Kid or his abilities. From what I can tell, based on the code he's written and on the few conversations I've had with him, he's very enthusiastic and very energetic, and he's certainly very confident of his abilities. He definitely understands object oriented programming as well as defensive programming techniques, and he's certainly not afraid to apply them. But there's a point where you need to stand back and really understand what it is you're trying to do, what problem you're trying to solve. The goal isn't to make the most object oriented or bullet proof code possible; the goal is to make a system which works and does what the customer wants.

In the meantime, I hear that Chicago has already placed the camera mounts for their new four lane bridge.