Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh man.

I was writing the motor controller code for a new submersible robot my PhD lab was building. We had bought one of the very first compact PCI boards on the market, and it was so new we couldn't find any cPCI motor controller cards, so we bought a different format card and a motherboard that converted between compact PCI bus signals and the signals on the controller boards. The controller boards themselves were based around the LM629, an old but widely used motor controller chip.

To interface with the LM629 you have to write to 8-bit registers that are mapped to memory addresses and then read back the result. The 8-bit part is important, because some of the registers are read or write only, and reading or writing to a register that cannot be read from or written to throws the chip into an error state.

LM629s are dead simple, but my code didn't work. It. Did. Not. Work. The chip kept erroring out. I had no idea why. It's almost trivially easy to issue 8-bit reads and writes to specific memory addresses in C. I had been coding in C since I was fifteen years old. I banged my head against it for two weeks.

Eventually we packed up the entire thing in a shipping crate and flew to Minneapolis, the site of the company that made the cards. They looked at my code. They thought it was fine.

After three days the CEO had pity on us poor grad students and detailed his highly paid digital logic analyst to us for an hour. He carted in a crate of electronics that were probably worth about a million dollars. Hooked everything up. Ran my code.

"You're issuing a sixteen-bit read, which is reading both the correct read-only register and the next adjacent register, which is write-only", he said.

Is showed him in my code where the read in question was very clearly a *CHAR*. 8 bits.

"I dunno," he said - "I can only say what the digital logic analyzer shows, which is that you're issuing a sixteen bit read."

Eventually, we found it. The Intel bridge chip that did the bus conversion had a known bug, which was clearly documented in an 8-point footnote on page 79 of the manual: 8 bit reads were translated to 16 bit reads on the cPCI bus, and then the 8 most significant units were thrown away.

In other words, a hardware bug. One that would only manifest in these very specific circumstances.

We fixed it by taking a razor knife to the bus address lines and shifting them to the right by one, and then taking the least significant line and mapping it all the way over to the left, so that even and odd addresses resolved to completely different memory banks. Thus, reads to odd addresses resolved to addresses way outside those the chip was mapped to, and it never saw them. Adjusted the code to the (new) correct address range. Worked like a charm.

But I feel bad for the next grad student who had to work on that robot. "You are not expected to understand this."



Heh messing with the traces is pretty nuts. In the book Where Wizards Stay Up Late about the origins of the internet there was a similar story. Some grad student needed a delay in some execution path so he cut the trace and used a very long wire to introduce the delay. I can’t remember the story exactly but it was like the ultimate brute force fix, using the laws of physics as the hack.


It's not a bug! It's a clearly documented feature! /s


This is a much better war story than most of its kind, thanks. I was terrified that you were just blithely doing a 16 bit read, I'm glad there was a much better explanation.


11/10 solution




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: