Hacker News new | past | comments | ask | show | jobs | submit login

Embedded systems often have crappy compilers. And you sometimes have to pay crazy money to be abused, as well.

Years ago, we were building an embedded vehicle tracker for commercial vehicles. The hardware used an ARM7 CPU, GPS, and GPRS modem, running uClinux.

We ran into a tricky bug in the initial application startup process. The program that read from the GPS and sent location updates to the network was failing. When it did, the console stopped working, so we could not see what was happening. Writing to a log file gave the same results.

For regular programmers, if your machine won't boot up, you are having a bad day. For embedded developers, that's just a typical Tuesday, and your only debugging option may be staring at the code and thinking hard.

This board had no Ethernet and only two serial ports, one for the console and one hard-wired for the GPS. The ROM was almost full (it had a whopping 2 MB of flash, 1 MB for the Linux kernel, 750 KB for apps, and 250 KB for storage). The lack of MMU meant no shared libraries, so every binary was statically linked and huge. We couldn't install much else to help us.

A colleague came up with the idea of running gdb (the text mode debugger) over the cellular network. It took multiple tries due to packet loss and high latency, but suddenly, we got a stack backtrace. It turned out `printf()` was failing when it tried to print the latitude and longitude from the GPS, a floating point number.

A few hours of debugging and scouring five-year-old mailing list posts turned up a patch to GCC (never applied), which fixed a bug on the ARM7 that affected uclibc.

This made me think of how the folks who make the space probes debug their problems. If you can't be an astronaut, at least you can be a programmer, right? :-)




At least the debugger worked. The processor I used in embedded systems in college, the 68HC11, would stop doing conditional branches when the supply voltage was too low.

We had a battery powered board, with no brownout detection, and I was using rechargable NiMH batteries to save money/waste. When the students with alkaline batteries had low batteries, the motor load would bring vcc down far enough that the CPU would reset by itself. With NiMH, the batteries could still drive the motors and keep the CPU alive...

You could single step in the debugger, and see the flag register was set as expected, but the branch didn't happen. Just ran straight through. I can't remember if unconditional jump or call worked. After about the third time this happened, I got good at figuring it out.


> For regular programmers, if your machine won't boot up, you are having a bad day. For embedded developers, that's just a typical Tuesday, and your only debugging option may be staring at the code and thinking hard.

Of course where it becomes even more fun is when it's a customer's unit in Peru and you can't replicate it locally :). But oh how I love it. I have definitely spent many a day staring at code piecing things together with what limited info we have.

But to get back on topic, I can definitely confer on the quality of most embedded compilers. It's a great day when I can just use normal old gcc. I've never run into anything explicitly wrong, but I see so many bits of weird codegen or missed optimisations that I keep the disassembly view open permanently, as a sanity check. The assembly never lies to you - until you find a silicon bug at least.


> For embedded developers, that's just a typical Tuesday

I was trying to explain to my colleague the other day that I've spent an unhealthy amount of time rebooting devices while staring at an LED wondering why it won't turn on.


Tuesday, indeed. :)

In the embedded world, correctly working hardware isn't a given, either. Part of the board bringup/hardware verification process is just determining that everything on the board actually works. Always fun when you have to figure out if a problem is in your code or in the hardware. (HINT: It's often both.)

It's rare that you need to break out the oscilloscope or logic analyzer, but when you absolutely have to know if that line went high or not, there's no substitute. :)


> (HINT: It's often both.)

Or worse, it’s neither! By which I mean both. Neither part of the design is technically wrong but the fault is in the way the two interact. Those are some of the fun ones… I had one where I had to make sure the chip select line was off before turning power off to a chip, because CS would keep it half powered.


At a sufficiently high resolution, all digital electronics is actually analog. :/


It is nuts to have a dev board that is constrained as the final device. You should have had an additional serial port and 8x as much flash, it would have solved your problem immediately.

It is even better to do the bulk of the dev inside of an emulator if you can swing it. The GPS and GPRS could be tethered into the emulator instead of trying to get a debug link into the system board.


Were these commodity boards? Having to resort to using the cellular connection, instead of attaching a hardware debugging probe (J-Link?) seems like a recipe for a painful squandering of intellect.


One of the lovely "features" of embedded work is that after a while of doing this sort of thing, sometimes you get good enough at the crazy hacks that it becomes faster and easier to do something like this than to track down who has the J-Link (okay, they've usually got more than one) and can they spare it/where did they put it/why does that person have a J-Link at all/is the J-Link still alive....


Oof, I remember doing lots of embedded stuff at university and this rings true.

The compiler we used was built off gcc so it was reasonably good but I remember we had some weird crash one day that I couldn't figure out. Eventually I added some inline assembly to do an absolute jump to the next place that it needed to go and it started working again. I was too inexperienced to know how to dig deeper but presumably the code generator had inserted something weird that was causing a crash.


Yeah, I have a war story...

I was working on mobile robot research at JPL back in the 1990s. We had a robot with an arm attached. It worked fine except that every now and then the whole system would crash hard with a totally corrupted heap and stack, just random data everywhere. So no chance of a backtrace. The really weird thing was that this only happened when the arm was moving. We also had the exact same system running under a different operating system and we never had any problems there, so we were 100% sure it was not a compiler error.

It was a compiler error.

It took us a year to figure out what was going on. It turned out that the compiler had a bug where it would emit code that would pop the stack pointer and then pull a value out of the now unprotected stack frame. On the non-embedded system this did not cause any problems, but on the embedded system (running vxWorks) hardware interrupts used the same stack as the process that was running when the interrupt hit. So if we happened to get an interrupt just after the stack pointer was popped but before the unprotected value was grabbed, that value would get stomped on by the interrupt handler. Then when the interrupt handler would return, the process would resume, grab the now-random value, and chaos ensued.


How many novel depressions were created as a result of high velocity impacts after making that discovery? I think I'd be seeing red...


Actually, I remember being thrilled to have finally figured it out. We had been beating our heads against the wall (metaphorically) for a year, and I remember looking at the screen at the disassembly sequence and thinking, Oh my God, I think I've found it! It felt like making a major scientific discovery. (To be fair, I was only able to do this after others laid the groundwork for me by finding ways to reliably reproduce the problem. But I'm the one who spent hours single-stepping through assembly code before finally realizing what was happening.)

I also remember reporting the problem to one of the authors of the compiler (I think it was David Kranz) so he could fix it in the next version and him telling me that there wasn't going to be a next version because the funding for the project had been cut. There was no github in those days so the whole thing just faded into the mists of time, which is a real shame because the system really kicked ass.

The whole history of the project can be found here:

https://paulgraham.com/thist.html


> For regular programmers, if your machine won't boot up, you are having a bad day. For embedded developers, that's just a typical Tuesday, and your only debugging option may be staring at the code and thinking hard.

It seems to me that if you can still update and reboot said machine, you can do a bisect on your commits to pinpoint the regression. Once you spot the regression commit you can split it to check what introduced the regression.


It took them multiple tries just to use gdb, I don’t think this is a scenario where you can easily reflash the image on the board


Did the GCC patch get applied after that?


"Never" implies no, I guess. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: