Skip to content
Back to Essays

Restart Protection: The AGC Software That Could Reboot Itself Mid-Flight

How the Apollo Guidance Computer's restart system preserved critical state across software restarts—a fault-tolerance design that kept computing through failures

Matt Dennis

The Apollo Guidance Computer could restart itself. Not reboot from scratch, not reload from tape, not cycle power and hope for the best—it could detect that something had gone wrong, terminate the current computation, and resume critical operations from a known-good state, all in about 40 milliseconds. The spacecraft never stopped flying. The guidance equations picked up where they left off. The Digital Autopilot kept firing jets. The crew might see a brief flicker on the DSKY and hear a program alarm tone, and then the computer was back, still tracking the trajectory, still pointed at the Moon.


This capability—restart protection—was one of the most sophisticated pieces of fault-tolerant software design in the 1960s, and it solved a problem that most software of that era didn’t even acknowledge: what happens when your code crashes and you can’t afford to start over?


Why Restarts Were Necessary

The AGC operated in an environment that could disrupt computation in ways no amount of careful programming could prevent. Hardware transients—voltage spikes, radiation-induced bit flips, electromagnetic interference from the spacecraft’s own systems—could corrupt a memory location, scramble a counter, or derail the program counter to a nonsensical address. Software errors, while extensively tested against, could not be ruled out entirely in a 36,864-word program of this complexity.


In a ground-based computer, the response to a crash was to call the operator, diagnose the problem, and restart from a checkpoint. In a spacecraft, there was no operator in that sense. The crew could press the DSKY’s RSET button to clear alarm indicators, but they couldn’t debug software in flight. And the computer couldn’t afford the time to reload from scratch even if a full reload were possible—during powered descent, a two-second interruption in guidance computations could result in a trajectory error that exceeded the vehicle’s ability to correct.


The AGC needed to recover from arbitrary failures while losing as little computational state as possible, and it needed to do it fast enough that the guidance and control functions never perceived the interruption. The restart protection system was designed to make this happen.


The Fresh Start and the Restart

The AGC had two levels of recovery. A “fresh start” was the more severe: it cleared the entire Executive job table, terminated all Waitlist tasks, reset all display routines, and returned the computer to a known idle state. The current program was terminated. The crew would need to manually reinitiate whatever program had been running. This was the nuclear option—used when the computer’s state was so corrupted that no partial recovery was safe.


A “restart” was more surgical. It preserved the major mode (the currently running program number), preserved certain critical variables that had been flagged as “restart-protected,” and attempted to resume the running program from a known checkpoint rather than from scratch. The Executive was reinitialized, Waitlist tasks were rebuilt from their restart data, and the program resumed execution at a point where it could re-derive any lost intermediate values from the protected state.


The distinction mattered. During powered descent, a fresh start would terminate P63 or P64, and the crew would have to manually restart the descent program while the LM continued hurtling toward the Moon. A restart, by contrast, would resume P63 or P64 from its last restart point, recompute any lost intermediate values from the protected navigation state, and continue guiding the descent. The crew would see a program alarm on the DSKY, Mission Control would evaluate the alarm code, and the computer would already be back to work.


Restart Groups and Protection Flags

The mechanism that made restarts work was the restart group system. Every critical computation in the AGC software was organized into restart groups—logical sections of code that could be restarted independently. Each group had a restart entry point: an address in the code where execution could safely resume after a restart, provided certain state variables had been preserved.


When a program entered a restart group, it wrote a group number and a phase number into a protected area of erasable memory. The phase number identified how far the computation had progressed within the group. If a restart occurred, the restart handler read these group and phase numbers and used them to determine where each interrupted computation should resume.


The programmer’s responsibility was to ensure that at each phase boundary—each point where the phase number was updated—the computation’s critical state had been saved to restart-protected variables. If the restart handler jumped to phase 3 of group 4, the code at that restart point had to be able to reconstruct the full computational state from the protected variables alone, without depending on any scratch values that might have been in erasable memory at the moment of failure.


This was a demanding programming discipline. Every AGC routine that performed guidance, navigation, or control computations had to be written with restart awareness. The programmer had to identify which intermediate values were expendable (could be recomputed) and which were irreplaceable (must be protected). The protected values were stored in specific erasable memory locations designated as restart-safe, and the phase number was advanced only after these values had been committed.


The restart protection code was not generated automatically. There was no compiler support, no framework, no abstraction layer. Each programmer manually inserted the INHINT (inhibit interrupts) instruction before writing restart data, updated the phase and group numbers using specific macros (TC PHASCHNG in the assembly code), and ensured the state was consistent before enabling interrupts again. A mistake—failing to protect a critical variable, updating the phase number before saving the state, protecting the wrong values—could result in a restart that resumed with inconsistent data, producing wrong guidance commands.


The Restart Handler

When a restart was triggered—by a hardware trap, a watchdog timer, or an explicit software-initiated restart—the restart handler took control. Its sequence was:


  1. Inhibit interrupts. Prevent any new interrupts from firing during the recovery process.

  1. Reset hardware. Clear the output channels that controlled RCS jets, engine gimbal commands, and display data. This was a safety measure: if the computer was confused about what commands it had issued, zeroing the outputs prevented a phantom jet firing or a stuck-on engine command from persisting through the restart.

  1. Reinitialize the Executive. Clear the job table. All running jobs were terminated. Any job that was critical would be recreated by the restart logic.

  1. Reinitialize the Waitlist. Clear all pending timed tasks. Any time-critical task that needed to resume would be rescheduled by the restart logic.

  1. Scan the restart group table. For each restart group that had a non-zero phase number, the handler created a new Executive job or Waitlist task to resume that computation at the appropriate restart entry point. The phase number determined which entry point to use.

  1. Restart the display. Reinitialize the DSKY routines and restore the major mode display. The PROG indicator continued to show the current program number (which was itself a restart-protected variable).

  1. Enable interrupts and resume scheduling. The Executive began running the recreated jobs in priority order. The Waitlist began servicing recreated tasks. The computer was back.

The entire restart sequence completed in approximately 40 milliseconds. From the guidance software’s perspective, the restart looked like a brief hiccup—the current computation cycle might be delayed or restarted, but the critical state variables (the navigation state vector, the current program parameters, the guidance targets) were intact in their protected memory locations.


What Got Protected

Not everything could survive a restart, and not everything needed to. The restart protection system was selective by design—protecting too much data would have consumed precious erasable memory, and protecting too little would have made restarts useless.


The navigation state vector—the spacecraft’s position and velocity in three-dimensional space—was always restart-protected. This was the single most important piece of data in the AGC’s memory, computed from hours or days of accelerometer integration and star sightings. Losing the state vector would mean losing the spacecraft’s knowledge of where it was, requiring a lengthy re-derivation from ground tracking data.


Guidance targets—the desired end state for the current maneuver, whether a landing site, an orbit insertion point, or a reentry corridor—were restart-protected. These were parameters that had been laboriously computed and uplinked, and losing them would require another uplink from Mission Control.


The current program number and the DAP configuration parameters were restart-protected. After a restart, the computer needed to know which program it was running and how the autopilot was configured without requiring crew intervention.


What was not protected: intermediate computational values that could be re-derived from the protected state. If the guidance algorithm was halfway through a two-second computation cycle when the restart occurred, the intermediate values from that cycle were lost. The restarted code simply restarted the computation cycle from the beginning, using the protected state vector and targets as inputs. The result was a single delayed guidance cycle—two seconds lost, then back on track.


Display data was not individually protected. The DSKY showed whatever the restarted display routines generated from the current protected state. The crew might see the display blank briefly during the restart, then repopulate with correct values. The flashing VERB-NOUN display might reset to the program’s default display, requiring the crew to re-request any non-standard display they’d been monitoring.


Restarts in Flight: The 1202 Revisited

The restart protection system’s finest hour was the Apollo 11 powered descent. The 1202 and 1201 program alarms triggered restarts—not crashes, not reboots, but the controlled restart sequence described above.


When the Executive overflowed because the rendezvous radar interrupts were consuming excess processing capacity, the restart handler fired. It cleared the job table, scanned the restart groups, and recreated the critical jobs: the powered descent guidance computation (high priority), the Digital Autopilot (high priority), and the navigation integration (high priority). Low-priority jobs—display updates, some telemetry tasks—were not in active restart groups at the moment of the overflow and were simply not recreated. They were shed.


The guidance computation restarted from its last phase checkpoint. The protected state vector was intact. The guidance targets were intact. The DAP configuration was intact. The computation restarted, completed its cycle, and issued new guidance commands. The Waitlist tasks for the DAP’s jet timing were recreated by the DAP’s restart entry point. The jets fired on schedule.


From the trajectory’s perspective, the restart cost one or two missed guidance cycles—two to four seconds of computation gap—which the guidance algorithm easily compensated for on the next cycle. The LM’s trajectory was continuous. The landing guidance never diverged. The crew saw PROG alarms on the DSKY and heard the alarm tone, but the computer was already recovered before anyone could react.


This happened five times during the Apollo 11 descent. Five restarts, five recoveries, five times the restart protection system proved its design. The landing was never in jeopardy from the computer’s perspective—the restarts were the system working as designed, gracefully shedding non-critical work to keep the critical path alive.


The Programmer’s Burden

Restart protection imposed a significant burden on every AGC programmer. Writing a routine wasn’t enough—you had to write a routine that could be interrupted at any point, have its entire context discarded, and resume correctly from a checkpoint that might be dozens of instructions behind. Every critical variable had to be identified and committed to protected storage before the phase number was advanced. Every restart entry point had to be verified: if execution starts here, with these protected values, does the code produce correct results?


The MIT programmers who built the AGC software—the Luminary and Colossus teams—internalized this discipline. Code reviews focused heavily on restart correctness. Simulation testing included deliberate restarts injected at random points during critical computations. The test philosophy was simple: if you claim your routine is restart-protected, we will restart it at every instruction boundary and verify the results are still correct.


The restart protection annotations in the AGC source code—visible in the reconstructed listings now available online—show the density of this engineering. TC PHASCHNG calls (task change phase) appear throughout the guidance and navigation routines, each one a checkpoint that says: “everything up to this point has been committed to protected storage; a restart from here will produce correct results.”


Some routines had a dozen or more phase changes. The powered descent guidance routine was particularly dense with restart protection, reflecting its criticality and the high probability that a restart might occur during the computation-heavy descent phase. The P63 braking guidance code, the P64 approach guidance code, and the P66 rate-of-descent code all included restart protection at multiple phase boundaries.


A 40-Millisecond Safety Net

Modern fault-tolerant systems use techniques that would be recognizable to the AGC designers: checkpointing, journaling, transaction logs, process supervision. Erlang’s “let it crash” philosophy—where processes fail fast and are restarted by supervisors from a clean state—echoes the AGC’s restart approach. The difference is that modern systems have the luxury of memory, storage, and redundant hardware. The AGC did it in 2,048 words of RAM, with no disk, no backup processor, and a 40-millisecond recovery window.


The restart protection system flew on every Apollo mission without modification from Colossus to Luminary, from Apollo 7 to Apollo 17. It handled hardware transients, executive overflows, and programming edge cases that slipped through testing. It never failed to recover a restart-protected computation correctly. It never lost a navigation state vector. It never left the spacecraft without guidance.


The AGC’s restart capability was not a safety feature bolted on after the fact. It was designed into the software architecture from the beginning, woven into the programming model at every level. Every line of critical code was written with the assumption that it might be interrupted and restarted at any moment. Every protected variable was a commitment: this value matters enough that it must survive a failure.


That assumption—that failure is not a possibility to be prevented but a certainty to be survived—is the most enduring lesson of the Apollo Guidance Computer. The restart protection system didn’t prevent the 1202 alarms on Apollo 11. It made them survivable. And in doing so, it proved that software can be built to be resilient in the face of the unexpected, if you design for it from the very first line of code.