A few CPU hardware bugs

(taricorp.net)

126 points | by signa11 2 days ago

15 comments

  • nippoo 2 days ago

    My favourite one of this kind is the Rockchip RK808 RTC, where the engineers thought that November had 31 days, needing a Linux kernel patch to this day that translates between Gregorian and Rockchip calendars (which are gradually diverging over time).

    Also one of my favourite kernel patch messages: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

    • nasretdinov 2 days ago

      It's always November, isn't it? I've once made a log collection system that had a map of month names to months (had to create it because Go date package didn't support that specific abbreviation for month names).

      As you might've guessed, it lacked November, but no one noticed for 4+ months, and I've left the company since. It created a local meme #nolognovember and even got to the public (it was in Russia: https://pikabu.ru/story/no_log_november_10441606)

      • Gibbon1 2 days ago

        That's gold.

        That hardware real time clocks keep time in date and time drives me batty. And no one does the right thing which is just a 64 bit counter counting 32khz ticks. Then use canned tested code to convert that to butt scratching monkey time.

        Story my old boss designed an STD Bus RTC card in 1978 or something. Kept time in YY:MM:DD HH:MM:SS 1/60 sec. And was battery backed. With shadow registers that latched the time. Couple of years later redesigned it as a 32 bit seconds counter. With a 32khz sub seconds counter. Plus a 48 bit offset register. What was a whole card was now a couple of 4000 series IC's on the processor card. He wrote 400 bytes of Z80 assembly to convert that to date and time. He said was tricky to get right but once done was done.

        • Someone 2 days ago

          I would guess that’s because those chips were designed for use in systems that didn’t have a CPU, but reading the data sheet, it doesn’t look as if you can easily hook up this thing to 7-segment LEDs, so maybe this is a matter of “this is how we always did it, and if it ain’t broke, don’t fix it”, and then ‘fix’ it, anyways?

          • alsetmusic 1 day ago

            I had some interactions with the guy responsible for the code that made our system do the right thing around Daylight Saving. Listening to him talk out loud as he thought about bugs was fascinating. He was clearly one of the smartest people I've met and I would quickly fall behind as he rationalized problems to himself. What a marvelous mind.

            • Some of them do have an epoch counter in addition to broken down time.

              • Gibbon1 1 day ago

                The Renesas RTC divides the 32khz clock by 256. And after waking up doesn't update the shadow registers till the next tick. So if you wake out of deep sleep you don't know what the time is for 8ms.

                I know of one that draws 0.5uA in normal mode but 12uA in binary counter mode.

                • ralferoo 1 day ago

                  I propose naming the code that ensures this 8ms has elapsed "the yawn".

                  But to be fair, it doesn't seem that onerous an issue - the biggest problem would have been if this was completely undocumented. One obvious workaround is to read the time immediately on wake up, and then ignore the result until reading the time returns something different.

            • quotemstr 2 days ago

              That one is up there with the all time greats.

              • lsofzz 2 days ago

                > Rockchip calendars

                >.< haha i remember this

              • b1temy 2 days ago

                > the characters ’n’ and ‘o’ differ by only one bit; an unpredictable error that sets that bit could change GenuineIntel to GenuineIotel.

                On a QWERTY keyboard, the O key is also next to the I key. It's also possible someone accidentally fat-fingered "GenuineIontel" , noticed something was off, and moved their cursor between the "o" and "n", and accidentally hit Delete instead of Backspace.

                Maybe an unlikely set of circumstances, but I imagine a random bit flip caused at the hardware-level is rare since it might cause other problems, if something more important was bit-flipped.

                • matja 2 days ago

                  I like this theory - I can totally imagine some big spreadsheet of processor model names where someone copy/pastes the model name to some janky firmware-programming utility running on an off-the-shelf mini PC on the manufacturing floor, implemented as a "temporary fix" 5 years ago, every time the production line changes CPU model.

                • userbinator 2 days ago

                  I am reminded of the old AMD CPUs with "unlockable" extra cores, which would when unlocked change the model name to something unusual.

                  "GenuineIotel" is definitely odd, but difficult to research more about; I suspect these CPUs might actually end up being collector's items sometime in the future.

                  because inserting no-op instructions after them prevents the issue.

                  The early 386s were extremely buggy and needed the same workaround: https://devblogs.microsoft.com/oldnewthing/20110112-00/?p=11...

                  • pm215 2 days ago

                    Some of the 386 bugs described there sound to me like the classic kind of "multiple different subsystems interact in the wrong way" issue that can slip through the testing process and get into hardware, like this one:

                    > For example, there was one bug that manifested itself in incorrect instruction decoding if a conditional branch instruction had just the right sequence of taken/not-taken history, and the branch instruction was followed immediately by a selector load, and one of the first two instructions at the destination of the branch was itself a jump, call, or return.

                    Even if you write up a comprehensive test plan for the branch predictor, and for selector loads, and so on, it might easily not include that particular corner case. And pre silicon testing is expensive and slow, which also limits how much of it you can do.

                    • adrian_b 2 days ago

                      80386 (1985) did not have a branch predictor, which was used first only in Intel Pentium (1993).

                      Nevertheless, the states of the internal pipelines, which were supposed to be stopped, flushed and restarted cleanly by taken branches, depended on whether the previous branches had been taken or not taken.

                      • pm215 2 days ago

                        Ah, thanks for that correction -- I jumped straight from "depends on the history of conditional branches" to "branch predictor" without stopping to think that that would have been unlikely in the 386.

                        • adrian_b 2 days ago

                          Before having branch predictors, most CPUs that used any kind of instruction pipelining behaved like a modern CPU where all the branches are predicted as not taken.

                          Thus on an 80386 or 80486 CPU not taken branches behaved like predicted branches on a modern CPU and taken branches behaved as mispredicted branches on a modern CPU.

                          The 80386 bug described above was probably caused by some kind of incomplete flushing of some pipeline after a taken branch, which leaved it in a state partially invalid, which could be exposed by a specific sequence of the following instructions.

                      • Taniwha 2 days ago

                        This sort of bug, especially in and around pipelines are always hard to find. In chips I've built we've had one guy who built a system that would build random instruction streams to try and trigger as many as we possibly could

                        • pm215 2 days ago

                          Yeah, I think random-instruction-sequence testing is a pretty good approach to try to find the problems you didn't think of up front. I wrote a very simple tool for this years ago to help flush out bugs in QEMU: https://gitlab.com/pm215/risu

                          Though the bugs we were looking to catch there were definitely not the multiple-interacting-subsystems type, and more just the "corner cases in input data values in floating point instructions" variety.

                          • Taniwha 1 day ago

                            I think FP needs it's own custom tests (billions of them!) - I hate building FP units, they are really the pits

                      • pjc50 2 days ago

                        The revenge of the MIPS delay slot (the architecture simply didn't handle certain aspects of pipelining, so NOPs were required and documented as such).

                        • ralferoo 1 day ago

                          It's not quite true to same NOPs were required. It was fairly common to just reorder instructions so that the branch instruction was just moved one forward so that the following instruction would execute before the branch target.

                          This also wasn't that uncommon. Sparc also had a delay slot that operated similarly to MIPS.

                      • Retr0id 2 days ago

                        The GenuineIotel thing fascinates me because I can't fully grasp how it could happen. I can imagine a physical defect causing a permanent wrong-bit in a specific piece of silicon, but it seems more widespread than that. Perhaps some kind of bug in the logic synthesis process?

                        • qiqitori 2 days ago

                          I came across a CPU bug that prevented Linux from booting on 3rd gen i3/i5/i7 CPUs. Did a bunch of printf debugging until I was right before the freeze. Then found something relevant in the CPU errata. It could be "fixed" by passing in noapic. I had a decent writeup on the old CentOS forums, but they're gone, and I don't have a copy of my writeup anymore.

                          • qiqitori 2 days ago

                            I just checked an old note I found, maybe it was noclflush actually. Affected one or more versions starting with 2.6.32-754.

                            • jraph 2 days ago

                              Maybe the Wayback machine archived it by any chance?

                              • Depends when it was - Web Archive didn't seem to archive forums for a long time (maybe it does now?).

                              • pixl97 2 days ago

                                noapic seemed to be a really common 'fix' for CPU and BIOS issues back then.

                              • chme 1 day ago

                                I had to deal with Intel Quark SoC X1000 on a Galileo board years ago, where the LOCK prefix instruction caused segfaults. Since the SoC is single threaded, the lock prefix could just be patched out from resulting binaries, before the compiler/build system was patched.

                                https://en.wikipedia.org/wiki/Intel_Quark#Segfault_bug

                                • charcircuit 2 days ago

                                  >The workaround for this is to cripple the system

                                  That is not the workaround in the documentation that was just linked.

                                    Workarounds:
                                    The solution to this problem is to put two instructions that do not require write back data after the mul instruction.
                                  
                                  This seems reasonable for your compiler vendor to implement without getting rid of multiplication altogether.
                                  • NobodyNada 1 day ago

                                    There's a difference in effort of several orders of magnitude between "change a setting so the compiler doesn't emit multiplies" and "convince GCC/LLVM to add a special-case flag for one very rare chip, or maintain your own fork". The vendor's workaround is the "ideal" solution, but disabling multiplies is a lot more practical if you don't need the performance.

                                    They also mention in the next sentence that they adopted the "correct" workaround (by providing a multiplication library function for the compiler to call).

                                    • charcircuit 1 day ago

                                      The company selling the chip can create a fork. They are typically the ones providing all of the sdks for you to use in order to use it, flash it, debug it, etc.

                                    • direwolf20 2 days ago

                                      even if you don't know which instructions they are, just place two nops after every mul, problem solved

                                    • IshKebab 2 days ago

                                      > To me, this issue doesn’t seem as embarrassing as Intel’s wrong CPUIDs. Pipelined CPUs are hard to build

                                      I disagree. Misspelling a name in the CPUID is kind of easy to do, somewhat awkward to test (in a non-tautological way), and pretty easy to work around.

                                      Having `mul ...; lw ...;` fail show that they've done very little testing of the chip. Any basic randomised pipeline testing would hit that trivial case.

                                      Essentially all CPUs are pipelined today. In-order pipelined CPU execution semantics are not particularly hard to test. Even some open source testing systems could detect this bug, e.g. TestRig or RISCV-DV.

                                      • direwolf20 2 days ago

                                        When you have a known hardware bug like needing a nop after every mul, compilers can do this. You don't need to turn off mul entirely.

                                        • 0xTJ 2 days ago

                                          The issue is that it's no longer actually RISC-V M at the point, you're changing the instruction set. If you're compiling RISC-V M code, that doesn't need the extra NOP.

                                          That being said, the disabling of MUL is being done at a software project level here, not by the CPU vendor. It's in the same linked commit that added in the NOP instructions to the arithmetic routines.

                                          • direwolf20 2 days ago

                                            If your software runs on any chip and your chip runs any software, you have a problem, but in embedded cases, you know which chip runs which software, because you designed them together.

                                            • Neywiny 2 days ago

                                              This is very true and why I'm not liking that Xilinx is trying to go the other way. It really gets in the way and doesn't work. I know what's connected to what and how, but their system device tree generator doesn't and it yells really loud about that. And I don't even need a device tree, just xparameters.h

                                        • direwolf20 2 days ago

                                          Will someone register the Iotel trademark and sue Intel? That was the purpose of the Intel string in reverse!

                                          • nabbed 1 day ago

                                            Wasn't there also a Pentium division bug of some sort. I didn't pay much attention to microcomputers back then (being a mainframe programmer at the time), but I remember hearing about it from the mainstream news.

                                            • mzs 2 days ago
                                              • hikkerl 2 days ago

                                                [flagged]

                                                • 6K76981-O 2 days ago

                                                  Writing software in embedded processor pipelines for bugs in the IT81202 CPU.

                                                  Microcode errata re-writes to GPR, compiling low level "mul," and "output," CPU RISC V to system archictecture.