The curious case of the dancing antennae



Misbehaving buffer pointers, whose effects threatened to create a fatal project setback, were identified via a clever software subdivision technique.

In the 1990s, I was working as a motion control engineer for the Giant Meter Wave Radio Telescope Project (GMRT). The radio telescope consists of 30 giant meter wave antennas, each a parabolic dish 45 meters in diameter. The motion control electronics (i.e., the control computer and power electronics) were located inside a control room within the supporting tower below each antenna. The servo computer received motion control coordinates from a master computer situated in a central building via an optical fiber link.

Do you have a memorable experience solving an engineering problem at work or in your spare time? Tell us your Tale

After the first two prototype antennae were commissioned, the radio astronomers started using them for their observations. Whenever a celestial object is to be observed, the antenna has to move in opposition to the earth’s motion in order to remain focused on the object under observation. After a few weeks I received a phone call from the security guard manning one of the antennas. “The antenna was dancing madly,” the guard said. “I had to shut off the power supply! Please come down and investigate.” I reached the project site only to discover that the problem could not be reproduced.

This story repeated itself every few days, for both antennae in turn. The control system development team blamed the “dancing behavior” on erratic fluctuation with the rural electricity grid, suggesting that “if your grid bus voltage dances madly, the antenna will do the same.” However, I couldn’t “buy” this explanation. If it had been true, a repeated power on/off sequence could have reproduced the problem. But it didn’t.

The developers then handed me a 2,500 page printout of the source code, which was written in Turbo Pascal. Since I instead suspected a control software bug as the culprit, I was tasked with finding it. But how could anyone debug such a voluminous amount of software, written by multiple development team members, none of them myself? And what debugging tools could I use to track down an issue that occurs only once a few weeks? The situation appeared hopeless.

I decided to make use of the three LEDs located on the front panel of the servo computer, Each LED can have three states: on, off and blink. So we have cube of three combinations, 27 possible combinations in total. I divided the program into 27 different parts. A specific combination out of the 27 was therefore illuminated on the LEDs each time the associated code portion was being executed. I then asked the security guard to record the LED pattern being displayed every time the antenna was “dancing”, before he shut down power.

After only two or three iterations of the “dancing antenna event”, the culprit area of the program was identified, located within a two-page portion of the original 2,500-page source code printout. I was admittedly thrilled at the seeming magic of my debugging technique. The culprit program segment implemented a 128 byte circular communication buffer. When the master computer was issuing commands, the buffer would store them until the servo computer could execute them. Occasionally, however, the motion trajectory was so fast that the buffer would also rapidly begin to fill up.

In the worst-case scenario, the entire 128-byte buffer would become full. The buffer management routine maintained two pointers: a read pointer to the next command to be executed and a write pointer to the last location written. The pointers normally circularly wrapped around after reaching the 128th location. However, in this particular situation the read pointer was erroneously advancing to an invalid 129th location instead. No wonder it would then read a junk motion control command, resulting in the antenna “dancing” erratically!

I corrected the bug, to the delight of the other team members. The antennae had been running the risk of falling down during the “dancing”, leading to a fatal setback for our project. After more than three decades of development work, I have accumulated enough experience (and experiences) to come up with “life-saving” countermeasures for bugs such as these:

  • Motion control software needs to carry out a “sanity check” before executing any motion command. Such a huge amount of inertia cannot be given a violent added acceleration beyond a reasonable threshold. Any command breaking this rule can be safely ignored, with an error subsequently flagged.
  • A simple checksum for every command bit stream could have identified a “junk motion command” situation such as the one described here.

Our project received a prestigious IEEE Milestone Award a few years ago. Needless to say, if this difficult-to-find bug had not been identified and rectified, the project would not have even seen the light of the day, far from basking in the global good reputation it has achieved over the years among the international radio-astronomer research fraternity.

Vishwas Vaidya is a graduate of the Indian Institute of Technology in Delhi, India. Currently, he is self-employed as an engineering consultant and industry faculty member in the field of embedded systems for global automotive clients and high-repute academic institutions. Vishwas’ articles and research reports have appeared in many worldwide engineering publications.

Related Content

 

The post The curious case of the dancing antennae appeared first on EDN.



Source link