The Therac-25 Incident (2021)

(thedailywtf.com)

449 points | by lemper 184 days ago

49 comments

  • benrutter 184 days ago

    > software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

    If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

    There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

    [0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

    • elric 184 days ago

      One of the commenters on the article wrote this:

      > Throughout the 80s and 90s there was just a feeling in medicine that computers were dangerous <snip> This is why, when I was a resident in 2002-2006 we still were writing all of our orders and notes on paper.

      I was briefly part of an experiment with electronic patient records in an ICU in the early 2000s. My job was to basically babysit the server processing the records in the ICU.

      The entire staff hated the system. They hated having to switch to computers (this was many years pre-ipad and similarly sleek tablets) to check and update records. They were very much used to writing medications (what, when, which dose, etc) onto bedside charts, which were very easy to consult and very easy to update. Any kind of dataloss in those records could have fatal consequences. Any delay in getting to the information could be bad.

      This was *not* just a case of doctors having unfounded "feelings" that computers were dangerous. Computers were very much more dangerous than pen and paper.

      I haven't been involved in that industry since then, and I imagine things have gotten better since, but still worth keeping in mind.

      • OskarS 184 days ago

        It's interesting to compare this with the Post Office Scandal in the UK. Very different incidents, but reading this, there is arguably a root assumption in both cases that people made, which is that "the software can't be wrong". For developers, this is a hilariously silly thing, but for non-developers looking at it from the outside, they don't have the capability or training to understand that software can be this fragile. And they look at a situation like the post office scandal and think "Either this piece of software we paid millions for and was developed by a bunch of highly trained engineers is wrong, or these people are just ripping us off". Same thing with Therac-25, this software had worked on previous models and the rest of the company just had this unspoken assumption that it simply wasn't possible that there was anything wrong with it, so testing it specifically wasn't needed.

        • isopede 184 days ago

          I strongly believe that we will see an incident akin to Therac-25 in the near future. With as many people running YOLO mode on their agents as there are, Claude or Gemini is going to be hooked up to some real hardware that will end up killing someone.

          Personally, I've found even the latest batch of agents fairly poor at embedded systems, and I shudder at the thought of giving them the keys to the kingdom to say... a radiation machine.

          • haunter 184 days ago

            My "favorite" part:

            >One failure occurred when a particular sequence of keystrokes was entered on the VT100 terminal that controlled the PDP-11 computer: If the operator were to press "X" to (erroneously) select 25 MeV photon mode, then use "cursor up" to edit the input to "E" to (correctly) select 25 MeV Electron mode, then "Enter", all within eight seconds of the first keypress and well within the capability of an experienced user of the machine, the edit would not be processed and an overdose could be administered. These edits were not noticed as it would take 8 seconds for startup, so it would go with the default setup

            Kinda reminds me how everything is touchscreen nowadays from car interfaces to industry critical software

            • michaelt 184 days ago

              I'd be interested in knowing how many of y'all are being taught about this sort of thing in college ethics/safety/reliability classes.

              I was taught about this in engineering school, as part of a general engineering course also covering things like bathtub reliability curves and how to calculate the number of redundant cooling pumps a nuclear power plant needs. But it's a long time since I was in college.

              Is this sort of thing still taught to engineers and developers in college these days?

              • Tenemo 184 days ago

                The full 1993 report linked in the article has an intetesting statement regarding software developer certfication in the "Lessons learned" chapter:

                > Taking a couple of programming courses or programming a home computer does not qualify anyone to produce safety-critical software. Although certification of software engineers is not yet required, more events like those associated with the Therac-25 will make such certification inevitable. There is activity in Britain to specify required courses for those working on critical software. Any engineer is not automatically qualified to be a software engineer — an extensive program of study and experience is required. Safety-critical software engineering requires training and experience in addition to that required for noncritical software.

                After 32 years, this didn't go the way the report's authors expected, right?

                • throwaway0261 184 days ago

                  One of the comments said this:

                  > That standard [IEC 62304] is surrounded by other technical reports and guidances recognized by the FDA, on software risk management, safety cases, software validation. And I can tell you that the FDA is very picky, when they review your software design and testing documentation. For the first version and for every design change.

                  > That’s good news for all of us. An adverse event like the Therac 25 is very unlikely today.

                  This is a case where regulation is a good thing. Unfortunately I see a trend lately where almost any regulation is seen as something stopping innovation and business growth. There are room for improvements and some areas are over regulated, but we don't want a "DOGE" chainsaw to regulations without knowing what the consequences are.

                  • rossant 184 days ago

                    The first commenter on this site introduces himself as "a physician who did a computer science degree before medical school." He is now president of the Ray Helfer Society [1], "an honorary society of physicians seeking to provide medical leadership regarding the prevention, diagnosis, treatment and research concerning child abuse and neglect."

                    While the cause is noble, the medical detection of child abuse faces serious issues with undetected and unacknowledged false positives [2], since ground truth is almost never knowable. The prevailing idea is that certain medical findings are considered proof beyond reasonable doubt of violent abuse, even without witnesses or confessions (denials are extremely common). These beliefs rest on decades of medical literature regarded by many as low quality because of methodological flaws, especially circular reasoning (patients are classified as abuse victims because they show certain medical findings, and then the same findings are found in nearly all those patients—which hardly proves anything [3]).

                    I raise this point because, while not exactly software bugs, we are now seeing black-box AIs claiming to detect child abuse with supposedly very high accuracy, trained on decades of this flawed data [4, 5]. Flawed data can only produce flawed predictions (garbage in, garbage out). I am deeply concerned that misplaced confidence in medical software will reinforce wrongful determinations of child abuse, including both false positives (unjust allegations potentially leading to termination of parental rights, foster care placements, imprisonment of parents and caretakers) and false negatives (children who remain unprotected from ongoing abuse).

                    [1] https://hs.memberclicks.net/executive-committee

                    [2] https://news.ycombinator.com/item?id=37650402

                    [3] https://pubmed.ncbi.nlm.nih.gov/30146789/

                    [4] https://rdcu.be/eCE3l

                    [5] https://www.sciencedirect.com/science/article/pii/S002234682...

                    • zackmorris 184 days ago

                      Our power went off a couple off weeks ago due to wind probably knocking a branch into a power line. Now our Frigidaire microwave runs with the door open.

                      Supposedly there are mechanical switches that prevent that, but evidently "modern" microwaves can control the gun through the logic board.

                      The engineering failures that led to this, from conceptual to design to internal control, boggle my mind. I'm not even sure where to send a complaint or if it would result in any kind of compensation. Because billion dollar corporations know that they'll never have to face any kind of corporate death penalty because they're protected by limited liability. So we'll just buy another $150 microwave instead.

                      Are smaller companies better at engineering safety? Evidently not.

                      • mdavid626 184 days ago

                        Some sanity checks are always a good idea before running such destructive action (IF beam_strength > REASONABLY_HIGH_NUMBER THEN error). Of course the UI bug is hard to catch, but the sanity check would have prevented this completely and the machine would just end up in an error, rather than killing patients.

                        • MarkusWandel 184 days ago

                          In a quick skim of the comments so far, I don't see the real smoking gun.

                          The previous devices had hardware interlocks. So if the software glitched, it was just an annoying glitch - nobody got zapped. But mature software gets trusted, so they removed the hardware interlock as redundant. And then the annoying glitches became fatal. Total miscommunication. The people cost-reducing the hardware interlock only saw mature, trustworthy software. The people living with the glitches only saw them as annoying, but harmless. And then, disaster.

                          • haddonist 184 days ago

                            Well There's Your Problem podcast, Episode 121: Therac-25

                            https://www.youtube.com/watch?v=7EQT1gVsE6I

                            • vemv 184 days ago

                              My (tragically) favorite part is, from wikipedia:

                              > A commission attributed the primary cause to generally poor software design and development practices, rather than singling out specific coding errors.

                              Which to me reads as "this entire codebase was so awful that it was bound to fail in some or other way".

                              • koverstreet 184 days ago

                                A lot of people draw the wrong conclusions from Therac-25 today; becoming overly process driven can become a huge problem for software quality, because the processes have to be the right processes, and once processes are in place people have a natural tendency to defer to them and suspend their own judgement.

                                That gets actively dangerous; a lot of more recent safety mishaps are more of the variety of "processes were followed, but things went hilariously off the rails and no one noticed and spoke up".

                                Culture and expertise matter just as much if not more, especially today now that we all (in theory) should understand source control, testing, safer languages, etc.

                                I think Admiral Rickover's methods apply just as much today, and applying that kind of thinking would fill major gaps in a lot of organizations - he emphasized good communication, a sense of responsibility, and thinking on your feet, and his safety record is unmatched.

                                I think aviation also approaches process a bit better - by having much of it be more informal, less rigid checklists, it doesn't encourage people to suspend judgement so much.

                                There's also the Tankship Tromedy, which really emphasizes the engineering legwork of just chasing down, understanding and fixing every last failure mode you can find.

                                https://www.dieselduck.info/library/08%20policies/2006%20The...

                                • cantrevealname 184 days ago

                                  > All of this software, from the individual processes to the OS itself, were the work of a single software developer. They left AECL in 1986, and no one has ever revealed their identity.

                                  I bet some readers are thinking that the developer that caused this tragedy retired with the millions he earned, maybe sailed his yacht to his Caribbean mansion. But the $300K FAANG salaries and multi-million stock options for senior developers represents the last decade or two. In the 1980's, developers were paid poorly and commanded little respect. The heroes in tech companies that sold expensive devices were the salesmen back then. The commission on the sale of a single Therac-25 probably exceeded the developer's salary.

                                  All of the following would indicate that this developer, no matter how senior or capable, was still a low-paid schlub:

                                  - It's Canada, so automatically 20% lower salaries than in the U.S. (AECL is in Canada, so it's a good bet that the developer was Canadian.)

                                  - It's the 1980s, so pre-web, pre-smartphones, pre-Google/Amazon, and developers had little recognition and low demand.

                                  - It's government, known to pay poorly for developers. (AECL is a government-owned corporation.)

                                  - It's mostly embedded software. Even though embedded software can be incredibly complex and life-critical, it's the least visible, so it's among the lower paid areas of software engineering (even today).

                                  For 1986, I would put his salary at $30-50K Canadian, or converted to U.S. dollars at that time would be $26-43K U.S., and inflation adjusted would be $78-129K U.S. today. And no stock options.

                                  • w10-1 184 days ago

                                    This is not the example readers need to understand, because the failures were so rudimentary and systemic that it seems "good process" is the answer.

                                    Having written and validated both FDA and CLIA software, I'd suggest that process is never sufficient.

                                    Plenty of well-meaning people will create and follow incomplete plans and hand-wave away issues when they sign off -- particularly people who gravitate towards rule-based, formulaic work in a hierarchy.

                                    You need people both capable of and willing to seriously question whether proof is really proof, and who will stand up for some random patient in the distant future over their boss and colleagues on a deadline -- and yet they cannot be oppositional or egotistical, and must have deep insight into the subject matter.

                                    It's really, really hard to find those people.

                                    • softwaredoug 184 days ago

                                      Safety problems are almost never about one evil / dumb person and frequently involve confusing lines of responsibility.

                                      Which makes me very nervous about AI generated code and people who don’t clam human authorship. The bug that creeps in where we scapegoat the AI isn’t gonna cut it in a safety situation.

                                      • csours 184 days ago

                                        To me, the Therac incident is the poster child for a category I call 'context change error'.

                                        Some of the controls were 'born' in a world of hardware interlocks, and so the engineers used the frame of mind where hardware interlocks exist.

                                        Some time later, the interlocks were replaced with software controls. Since everything had worked before, all the software had to do was what worked before.

                                        But it is VERY difficult to challenge all of your assumptions about what "working" means.

                                        ---

                                        This is also a good reminder that work is done by people and teams, not corporations. That is - just because somebody knows the fine details, that does not mean that the corporation knows the fine details.

                                        • siva7 184 days ago

                                          > With AECL's continued failure to explain how to test their device

                                          They can't. There was a single developer, he left, no tests existed, no one understood the mess to confidently make changes. At this point you can either lie your way through the regulators or scrap the product altogether.

                                          I've seen this kind of devs and companies running their software in regulated industries like in the therac incident, just now we are in the year 2025. I left because i understood that it's a criminal charge waiting to happen.

                                          • ChrisMarshallNY 184 days ago

                                            I worked for hardware manufacturers for most of my career, as a software guy.

                                            In my experience, hardware people really dis software. It's hard to get them to take it seriously.

                                            When something like this happens, they tend to double down on shading software.

                                            I have found it very, very difficult to get hardware people to understand that software has a different ruleset and workflow, from hardware. They interpret this as "cowboy software," and think we're trying to weasel out of structure.

                                            • smarks 183 days ago

                                              I believe the definitive analysis of the Therac-25 incident was written by Nancy Leveson, first in IEEE Computer,[1] and later as an appendix of her book.[2] The appendix is freely available as a PDF on the web [3][4] and probably other places. Many people here are asking questions about what happened and how it came about. The answers to many of these questions can be found there. I strongly recommend that anyone who is serious about safety and wants to learn more about this incident read Leveson’s analysis.

                                              [1]: N. G. Leveson and C. S. Turner, "An investigation of the Therac-25 accidents," in Computer, vol. 26, no. 7, pp. 18-41, July 1993.

                                              [2]: Nancy Leveson. Safeware: System Safety and Computers. Addison-Wesley, 1995.

                                              [3]: http://sunnyday.mit.edu/papers/therac.pdf

                                              [4]: https://web.mit.edu/6.033/2014/wwwdocs/papers/therac.pdf

                                              • MerrimanInd 184 days ago

                                                Every mechanical engineer educated in the USA knows the name of two famous collapses: the Tacoma Narrows Bridge and the Hyatt Regency balcony in Kansas City, MO. With an engineering ethics class being part of nearly every undergrad curriculum, these are two of the classic examples for us. I'm curious; do software engineers learn stories like the Therac-25 in their degrees?

                                                • rokkamokka 184 days ago

                                                  I was taught this incident in university many years ago. It's undeniably an important lesson that shouldn't be forgotten

                                                  • rvz 184 days ago

                                                    We're more likely to get a similar incident like this very quickly if we continue with the cult of 'vibe-coding' and throwing away basic software engineering principles out of the window as I said before. [0]

                                                    Take this post-mortem here [1] as a great warning and which also highlights exactly what could go horribly wrong if the LLM misreads comments.

                                                    What's even more scarier is each time I stumble across a freshly minted project on GitHub with a considerable amount of attention, not only it is 99% vibe-coded (very easy to detect) but it completely lacks any tests written for it.

                                                    Makes me question the ability of the user prompting the code in the first place if they even understand how to write robust and battle-tested software.

                                                    [0] https://news.ycombinator.com/item?id=44764689

                                                    [1] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

                                                    • rendaw 184 days ago

                                                      So reading about this my current company sounds exactly the same. And the one before it, and the one before that.

                                                      Critical issues happen with customers, blame gets shifted, a useless fix is proposed in the post mortem and implemented (add another alert to the waterfall of useless alerts we get on call), and we continue to do ineffective testing. Procedural improvements are rejected by the original authors who were then promoted and want to keep feeling like they made something good and are now in a position to enforce that fiction.

                                                      So IMO the lesson here isn't that everyone should focus on culture and process, it's that you won't have the right culture and process and (apparently) laws and regulation can overcome the lack of culture and process.

                                                      • mellosouls 184 days ago

                                                        TIL TheDailyWTF is still active. I'd thought it had settled to greatest hits only some years ago.

                                                        • autonomousErwin 184 days ago

                                                          This reminds me of the Belgium 2003 election that was impossibly skewered by a supernova light years away sending charged particles which manage to get through our atmosphere (allegedly) and flipping a bit. Not the only case it's happened.

                                                          • linohh 184 days ago

                                                            In my university this case was (and probably still is) subject of the first lecture in the first semester. A lot to learn here and one of the prime examples how the DEPOSE model [Perrow 1984] works for software engineering.

                                                            • salynchnew 184 days ago
                                                              • Forgret 184 days ago

                                                                What surprised me most was that only one developer was working on such an unpredictable technology, whereas I think I need at least 5 developers to be able to discuss options.

                                                                • 0xDEAFBEAD 184 days ago

                                                                  >any bugs we see would have to be transient bugs caused by radiation or hardware errors.

                                                                  Can't imagine that radiation might be a factor here...

                                                                  • SirMaster 184 days ago

                                                                    The question I have is why was the hardware capable of delivering a fatal dose like this. Is that actually ever even a usable output for some legitimate reason?

                                                                    If not, why not hardware limit the power input to the machine, so even if the software completely failed, it would not be physically capable of delivering a fatal dose like this?

                                                                    • onewheeltom 183 days ago

                                                                      The manufacturer of the Therac-25, AECL, did not share customer incident reports with other customers when patients were injured. So, the hospitals believed that their incidents were isolated. This may have been legal, but was highly unethical.

                                                                      • snkline 184 days ago

                                                                        I was kinda shocked by the results of his informal survey, because this was a big focus of my ethics course in college. I guess a lot of developers either didn't get a CS degree, or their degree program didn't involve an ethics course.

                                                                        • amelius 184 days ago

                                                                          > The Therac-25 was the first entirely software-controlled radiotherapy device.

                                                                          This says it all.

                                                                          • Duanemclemore 184 days ago

                                                                            There's an excellent episode of Well There's Your Problem about Therac-25.

                                                                            https://youtu.be/7EQT1gVsE6I

                                                                            • NoSalt 184 days ago

                                                                              Is there a way to get the "gist" of the article, the lesson to be learned without reading the full article? I got to the screaming part and couldn't read any more.

                                                                              • armcat 184 days ago

                                                                                Therac-25 was part of the mandatory "computer ethics" course at my uni, as part of the Computer Science programme, circa early 2000s.

                                                                                • tedggh 184 days ago

                                                                                  TL;DR

                                                                                  The Therac-25 was a radiation therapy machine built by Atomic Energy Canada Limited in the 1980s. It was the first to rely entirely on software for safety controls, with no hardware interlocks. Between 1985 and 1987, at least six patients received massive overdoses of radiation, some fatally, due to software flaws.

                                                                                  One major case in March 1986 at the East Texas Cancer Center involved a technician who mistyped the treatment type, corrected it quickly, and started the beam. Because of a race condition, the correction didn’t fully register. Instead of the prescribed 180 rads, the patient was hit with up to 25,000 rads. The machine reported an underdose, so staff didn’t realize the harm until later.

                                                                                  Other hospitals reported similar incidents, but AECL denied overdoses were possible. Their safety analysis assumed software could not fail. When the FDA investigated, AECL couldn’t produce proper test plans and issued crude fixes like telling hospitals to disable the “up arrow” key.

                                                                                  The root problem was not a single bug but the absence of a rigorous process for safety-critical software. AECL relied on old code written by one developer and never built proper testing practices. The scandal eventually pushed regulators to tighten standards. The Therac-25 remains a case study of how poor software processes and organizational blind spots can kill—a warning echoed decades later by failures like the Boeing 737 MAX.

                                                                                  • fogzen 184 days ago

                                                                                    Almost. It’s a process problem. But the process is a step above the organization. It’s a socio-economic process that incentivizes these problems. It’s capitalism that’s the process problem. That’s the process that introduces the problem into the organization. Without the government regulators making them test nothing would have even been done at all. Because the problem is the organization exists within a framework that pits it against safety. Safety is at odds with what the organization is tasked to do within the process that it exists in.

                                                                                    • voxadam 184 days ago

                                                                                      (2021)

                                                                                      • napolux 184 days ago

                                                                                        The most deadly bug in history. If you know any other deadly bug, please share! I love these stories!

                                                                                        • darepublic 183 days ago

                                                                                          sad story. gotta blame canada for this crap. The elements of this story.. hospitals.. a janky attempt at innovation.. passive aggressive denials from otherwise timid demure canadians. Cold grey bureacracy. It all reminds me of the not so great north. The technicians were sipping on their tim hortons slop at the time to make it perfect.

                                                                                          • MilyMason2 183 days ago

                                                                                            [flagged]

                                                                                            • auggierose 184 days ago

                                                                                              Wondering if that "one developer" is here on HN.