IBM System/360 Model 91

Source-of-truth research file for NWP-history blog post on Tomasulo’s algorithm and dynamic instruction scheduling. All claims should be cited; flagged disagreements are marked [DISPUTED], and dates that are widely cited but uncertain are marked [CHECK].

Summary

The IBM System/360 Model 91 was IBM’s flagship scientific machine of the late 1960s — an aggressive, pipelined, out-of-order superscalar built explicitly to chase the CDC 6600. It was a commercial disappointment (only ~15 ever built) but an architectural watershed: the floating-point unit was the first hardware implementation of Tomasulo’s algorithm, introducing reservation stations, the Common Data Bus, and hardware register renaming — techniques that lay essentially dormant in commercial CPUs for nearly thirty years before Intel’s P6 (Pentium Pro, 1995) revived them, after which they became universal in high-performance microprocessors.

1. Machine architecture and engineering

1.1 Dates

Announced: 1964. Wikipedia and IBM archives place this in 1964 (the user’s note of November 1964 is plausible but not specifically verified in sources consulted; treat the month as [CHECK]). Source: Wikipedia, “IBM System/360 Model 91.”
Released / officially shipped: January 1966 announcement of orderable configurations; first delivery to NASA Goddard Space Flight Center in October–November 1967 (nine months late), with regular operations beginning January 1968. Sources: Wikipedia “IBM System/360 Model 91”; Padegs (1981); Smotherman compendium.
Orders closed: Late 1968 — IBM stopped accepting Model 91 orders, partly because of the unfinished ACS effort and partly because the M65/M75 line and impending Model 195 made the Model 91 commercially awkward.
End of production / withdrawn: Production was very limited. The successor IBM System/360 Model 195 was announced 20 August 1969, first delivered in 1971, and withdrawn from marketing on 9 February 1977. Source: Wikipedia “IBM System/360 Model 195.”

1.2 Performance

CPU cycle time: 60 nanoseconds (≈ 16.7 MHz). Sources: Anderson, Sparacio & Tomasulo (1967); Columbia history page.
Peak instruction issue rate: 1 instruction per cycle issue (in-order issue, out-of-order completion). Quoted “up to 16.6 MIPS” peak. Source: Wikipedia.
Sustained MFLOPS: Roughly 16–17 MFLOPS peak in 64-bit floating-point on tight kernels; ~3–5 MFLOPS on real workloads is the figure usually quoted. [CHECK] — sources disagree on sustained rates; Hennessy & Patterson cite the Model 91 at roughly 4× the CDC 6600 in peak FP, but real-world figures depend on memory access patterns.
vs CDC 6600: The Model 91 was designed as the deliberate competitive answer to the 6600 (Cray’s 1964 machine, ~10 MIPS, 100 ns cycle, 10 functional units with scoreboard). On peak FP throughput the 91 was nominally ~3× the 6600; in practice the 6600 delivered comparable or better throughput on many real codes because its memory system was simpler and its compiler stack more mature. Tomasulo’s algorithm gave the 91 a theoretical advantage on loops with WAR/WAW dependences that the 6600’s scoreboard had to stall on, but the 91’s imprecise interrupts and software immaturity often handed wins back. Sources: Thornton, Design of a Computer: The Control Data 6600 (1970); Hennessy & Patterson, Computer Architecture: A Quantitative Approach, Appendix C/3 in editions 4–6.

1.3 Memory hierarchy

Main memory: IBM 2395 Processor Storage. Configurations: 91K (2 MB), 91KK (4 MB), 91L (4 MB), 91KL (6 MB).
Memory cycle time: 780 ns per bank (per Anderson et al., IBM J. Res. Dev., “The Storage System,” 1967). To bridge the 60 ns CPU cycle / 780 ns memory cycle ratio (~13:1), main memory was interleaved up to 16 ways. The NASA Goddard Model 91 ran with 2,097,152 bytes of main memory interleaved 16 ways. Source: Anderson, Earle, Goldschmidt, & Powers, “The IBM System/360 Model 91: Storage System,” IBM Journal of Research and Development 11(1):54–68, January 1967.
Explicit goal: Memory-latency hiding was the central design driver. The instruction unit pre-fetched, decoded, and dispatched well ahead of execution; the floating-point unit’s reservation stations decoupled instruction issue from memory operand availability. The whole architecture was conceived as a way to tolerate the 13× speed gap between CPU and memory rather than try to close it.

1.4 Floating-point unit organization

The Model 91 FPU is the canonical early example of decoupled, dynamically scheduled functional units. The published structure (Tomasulo 1967; Anderson, Powers, Sparacio, Tomasulo 1967) is:

Floating-point Add unit: pipelined, with 3 reservation stations (A1, A2, A3).
Floating-point Multiply/Divide unit: 2 reservation stations (M/D1, M/D2). Multiply is pipelined; divide is iterative (not pipelined).
Floating-point operand stack (FLOS): holds up to 8 instructions waiting for issue.
Store data buffers: 3 buffers for results destined for memory.
Load buffers: 6 floating-point buffers (FLBs) for operands fetched from memory.
Floating-point registers: the architectural 4 64-bit FP registers of S/360, plus tag fields used for renaming.

(User’s note “2 add units + 2 mul/div units” is incorrect: there is one add unit with three RSes and one multiply/divide unit with two RSes. The “two multipliers” myth probably comes from the multiply unit being internally pipelined to accept a new operation every cycle while a previous one is still in flight.)

1.5 The Common Data Bus (CDB)

The CDB is the architectural innovation of the Model 91 FPU. It is a single broadcast bus fed by every functional unit that can produce a result, and read by every consumer (reservation stations, FP registers, store buffers). When a functional unit completes, it puts its result and its tag on the CDB; any reservation station, register, or store buffer waiting on that tag captures the value in the same cycle. This collapses the producer→register-file→consumer round-trip into a single broadcast, eliminating the write-back stage as a serialization point and turning the register file into one of several possible operand sources rather than the only one. Source: R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of Research and Development 11(1):25–33, January 1967, esp. Fig. 2.

1.6 Imprecise interrupts (the famous flaw)

Because instructions could complete out of order, when a floating-point exception occurred the architectural PC (“PSW” in S/360 terms) no longer pointed at the offending instruction. The 91 raised an imprecise interrupt: it told the OS that something had failed, but not which instruction caused it nor what the precise machine state had been at that point. This made OS-level recovery (correcting an underflow, restarting a divide, even producing a usable traceback) effectively impossible.

The Model 91 designers’ own retrospective (Anderson, Sparacio, Tomasulo 1967; Padegs 1981) is unusually blunt: providing precise interrupts would have required either forcing sequential execution (defeating the entire architecture) or saving enough state to recover (prohibitively expensive in 1965-era hardware). Their compromise was a non-overlapped mode switch that made the machine run at roughly Model 75 speed but with precise interrupts — usable for debugging, useless for production. The trade was made explicit to customers: in exchange for performance, they gave up FP error recovery. Imprecise interrupts did not survive into the Model 195 or subsequent IBM big iron. Source: Mark Smotherman, “IBM Systems 360 Model 91 Interrupts,” people.computing.clemson.edu/~mark/ibm360m91int.html, citing Padegs, “System/360 and Beyond,” IBM J. Res. Dev. 25(5), Sept. 1981.

This issue was the practical reason the Model 91 was hard to deploy on conventional commercial workloads, and one reason commercial CPUs avoided aggressive OoO for thirty years until Smith & Pleszkun’s reorder-buffer technique (1985, 1988) showed how to make precise interrupts compatible with OoO execution.

1.7 Stretch (IBM 7030) lineage

The Model 91’s intellectual ancestry runs through the IBM 7030 Stretch (delivered 1961). Stretch’s “instruction lookahead” — Gene Amdahl and John Backus’s “asynchronous non-sequential” (ANS) control, refined by John Cocke and Harwood Kolsky — was IBM’s first attempt at decoupling instruction fetch from execution to hide memory latency. Stretch was a commercial failure (price was halved on delivery; Watson called the project a near-disaster publicly), but its lookahead unit, branch-target prediction, and pipelined arithmetic were the seedbed from which the 360/91 grew. Sources: Bashe, Johnson, Palmer & Pugh, IBM’s Early Computers, MIT Press, 1986; Mark Smotherman, “Organization Sketch of IBM Stretch.”

The continuity of personnel matters: Tomasulo, Earle, Anderson, and Sparacio had all worked Stretch problems before the 91. Cocke moved from Stretch to Project X / ACS (see §3) without ever doing 91 work directly, which is why the 91 and ACS architectures have the same DNA but different morphologies.

1.8 Pipelined FP, integer/FP path differences

Integer pipeline: simple, in-order, two-stage execute. Integer instructions execute in the instruction unit with no Tomasulo machinery.
FP pipeline: the entire reservation-station + CDB system applies only to floating-point arithmetic. Out-of-order completion is a property of the FPU, not of the integer or memory units.
Memory: load/store operations interact with the FPU through the load buffers (FLBs) and store buffers; memory ordering is approximate (“memory stores could occur out of sequence” — Wikipedia, citing the Functional Characteristics manual GA22-6907-2).

1.9 System/360 binary compatibility

The Model 91 implements the full System/360 instruction-set architecture. Programs compiled for any other S/360 (e.g., Model 75) ran without recompilation — that was the whole point of S/360 as a family. The 91’s deviations were behavioral: imprecise exceptions, last-bit differences in FP divide, out-of-order memory stores, and altered underflow/overflow handling.

On decimal arithmetic: the user’s note about “no decimal arithmetic in some 91s” is [FLAG / partly disputed]. The standard 360/91 supported the full S/360 instruction set including decimal (packed BCD) arithmetic. However, decimal arithmetic was implemented in the integer/storage path, not in the high-speed FPU; sources I consulted do not document a 91 SKU shipped without decimal. If an “internal use” 91 was stripped of decimal it isn’t documented in Wikipedia or Smotherman’s compendium. Treat as a claim needing a primary source (probably the Functional Characteristics manual).

1.10 The Model 95 and Model 195

Model 95: A NASA-only variant of the Model 91 in which 1 MB of main memory was replaced by thin-film memory (54 ns cycle vs 780 ns core). Two units built. Both went to NASA: Goddard Space Flight Center (Maryland) and Goddard Institute for Space Studies (Manhattan). Year: 1968 is the usually cited delivery date and is consistent with all sources, though the user’s “(was it 1968?)” cannot be precisely month-pinned from public sources. Source: Wikipedia.
Model 195: The actual successor. Announced August 1969, delivered 1971, withdrawn 1977. Reimplemented the Model 91 logic in monolithic ICs; added a 32 KB cache; cycle time 54 ns. Crucially, the 195 dropped imprecise interrupts. The 195 is what GFDL ran from 1974 to 1975 (and is the machine on which the Manabe & Wetherald 1975 paper was actually computed; see §4).

2. Tomasulo’s algorithm

2.1 The 1967 paper — full citation

R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of Research and Development, Vol. 11, No. 1, pp. 25–33, January 1967. DOI: 10.1147/rd.111.0025. Received by the journal 16 September 1965; published January 1967.

Companion papers in the same issue:

D. W. Anderson, F. J. Sparacio, R. M. Tomasulo, “The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling,” IBM J. Res. Dev. 11(1):8–24, January 1967.
S. F. Anderson, J. G. Earle, R. E. Goldschmidt, D. M. Powers, “The IBM System/360 Model 91: Floating-Point Execution Unit,” IBM J. Res. Dev. 11(1):34–53, January 1967.
L. J. Boland, G. D. Granito, A. U. Marcotte, B. U. Messina, J. W. Smith, “The IBM System/360 Model 91: Storage System,” IBM J. Res. Dev. 11(1):54–68, January 1967.

The whole January 1967 issue is the single best primary source for the 91.

2.2 Reservation stations as register renaming

Each reservation station is logically an unnamed register that can hold an operand or a tag pointing to whichever functional unit will produce that operand. When an instruction is dispatched to a reservation station, its source operands are either supplied immediately (from the FP register file) or replaced by tags (the IDs of the producing reservation stations). The architectural FP register file thus acts as a cache of currently-committed values rather than the only operand source. When an instruction in a reservation station receives both its operands (via the CDB), it executes; when it completes, it broadcasts result + tag.

The renaming is implicit: the same architectural register name written by two different instructions becomes two different reservation-station tags, with no architectural collision. This is the essential mechanism that dissolves WAR (Write-After-Read) and WAW (Write-After-Write) hazards: the two writers go into different physical “slots” and consumers read from whichever tag is correct, so the writes can complete in any order and no false dependency stalls execution.

RAW (Read-After-Write) is not eliminated by renaming. A true data dependence — instruction j reads what instruction i wrote — is preserved in the tags themselves: j’s reservation station holds the tag for i’s producing unit, and j genuinely cannot execute until i has broadcast its result on the CDB. Renaming only removes false dependencies (WAR/WAW); the real data-flow constraint (RAW) is the limit of available parallelism. This is the core insight.

2.3 CDB as result-forwarding

The CDB is what makes renaming practical. Without it, reservation stations would have to poll the register file or wait for a writeback stage; with it, results reach all waiting consumers in the same cycle they are produced. The CDB is the data-flow graph executing in hardware. (See Patterson & Hennessy on this point in essentially every edition of Computer Architecture: A Quantitative Approach, §3.5–3.6 in the 5th–6th editions.)

2.4 Tomasulo vs Thornton’s CDC 6600 scoreboard

Property	CDC 6600 Scoreboard (Thornton, 1964)	IBM 360/91 Tomasulo (1967)
WAR hazards	Stalls writeback until prior reads complete	Eliminated by renaming
WAW hazards	Stalls issue until prior write completes	Eliminated by renaming
RAW hazards	Stalls execute until operand ready	Stalls issue into RS, but consumer waits via tag/CDB rather than blocking later instructions
Result forwarding	Through register file	Through CDB broadcast (one-cycle forward)
Scope	Whole CPU (all 10 functional units)	FP unit only
OoO scope	Issue out-of-order, complete out-of-order	Issue in-order, complete out-of-order

What the 91 could do that the 6600 couldn’t: iterate tightly through a loop with cross-iteration register reuse. A FORTRAN inner loop that reuses the same FP register as accumulator across iterations would stall the 6600’s scoreboard at every loop turn (WAR or WAW); on the 91, register renaming made each iteration independent up to the true RAW chain.

What the 6600 could do that the 91 couldn’t: handle exceptions cleanly and run real OS code without architectural compromise. The 6600’s scoreboard preserved precise state. Combined with its stunning memory bandwidth and 10-functional-unit organization, this made it the workhorse of weather and physics labs through the late 1960s; the 91 was rarer, weirder, and harder to live with.

The standard reference for the comparison is Hennessy & Patterson, Computer Architecture: A Quantitative Approach, §3.4–3.7 in the 5th–6th editions; also Thornton, Design of a Computer: The Control Data 6600, Scott, Foresman & Co., 1970.

2.5 The Tomasulo paper’s worked example

Tomasulo (1967) illustrates the algorithm with a tight FORTRAN-style inner loop equivalent to a DAXPY kernel — the canonical Y(I) = A * X(I) + Y(I) computation of dense linear algebra. (User’s note is correct; the paper’s Fig. 6 walks through the timing of this loop and shows that without renaming, the loop would stall on the reuse of FP register 0 every iteration.) The example demonstrates that Tomasulo’s algorithm achieves close to the data-flow limit for the loop, while a purely scoreboarded machine cannot.

2.6 The thirty-year gap and the P6 revival

After the 91 (and to a lesser extent the 195), Tomasulo’s algorithm essentially disappeared from commercial CPUs for ~28 years. Reasons:

Imprecise interrupts were unacceptable for general-purpose OS workloads.
The technique required a transistor budget unavailable in the SSI/MSI/LSI generations of the 1970s.
RISC took the architectural energy of the 1980s in a different direction (deep pipelines, software scheduling).
Smith & Pleszkun (1985, 1988) provided the missing piece — the reorder buffer — making precise interrupts compatible with out-of-order completion.

The commercial revival came with Intel’s P6 microarchitecture (Pentium Pro), launched 1 November 1995. P6 used a Tomasulo-derived scheme: 20-entry reservation station array, register renaming via a unified physical register pool, micro-op decode feeding a reorder buffer for in-order retirement. Patterson & Hennessy use the P6 as their case study of Tomasulo applied at scale (5th edition, §3.10–3.13). Source: Wikipedia “Pentium Pro”; Hennessy & Patterson chapter on ILP.

2.7 Modern lineage in one paragraph

After P6, dynamic scheduling with reservation stations and renaming became universal in high-performance CPUs. AMD K5 (1996) used a Tomasulo derivative; the K7/Athlon (1999) refined it; IBM POWER has used reservation-station-based OoO since POWER1 (1990) with successive elaboration through POWER2/3/4 (Tendler et al. 2002); MIPS R10000, DEC Alpha 21264, HP PA-8000, PowerPC 604 all shipped Tomasulo-class engines in the mid-1990s. Apple Silicon (M1, 2020 and successors) is a deeply OoO Tomasulo machine with very wide reservation/issue (8+ wide decode, 600+ in-flight instructions). Every high-performance CPU in 2026 — x86, ARM, POWER, RISC-V — runs some descendant of what Tomasulo wrote down in IBM Poughkeepsie in 1965.

3. Project Y / ACS-1

3.1 Origins and Jenny Lake

Project X began at IBM Yorktown in January 1961 (Amdahl, Boehm, Cocke working on a stack-architecture “supercomputer beyond Stretch”). The pivotal moment was the Jenny Lake meeting at Jackson Hole, Wyoming, in September 1963, where Thomas J. Watson Jr. directed IBM Research “to undertake the design of a machine of highest attainable performance” — explicitly more aggressive than Stretch or the just-announced CDC 6600. Source: Pugh, Building IBM: Shaping an Industry and Its Technology, MIT Press, 1995; corroborated by Mark Smotherman’s “Timeline of IBM ACS Project,” people.computing.clemson.edu/~mark/acs_timeline.html. (User’s reference to Jenny Lake September 1963 is correct.)

Project X became Project Y in November 1963, formally housed in Jack Bertram’s Experimental Computers and Programming group at Yorktown. In June 1965 the supercomputer effort was relocated to California under Max Paley as ACS (Advanced Computing Systems); Menlo Park became headquarters in 1967.

3.2 Personnel

John Cocke — chief architect of the ACS-1 instruction set, decoupled-execution and lookahead design, and the central figure in IBM’s high-performance work from Stretch through 801/RISC. Turing Award 1987.
Lynn Conway — joined ACS to write a software simulator. She invented multiple-out-of-order dynamic instruction scheduling (“DIS”): a bit-matrix tracking which instructions in a window were ready and which were waiting on operands, capable of issuing more than one instruction per cycle. This is the conceptual core of superscalar OoO and predates anything similar in published commercial hardware. Source: Conway’s “VLSI Reminiscences” (IEEE Solid-State Circuits Magazine, 2012); Wikipedia “Lynn Conway.” Her bit-matrix scheme was more ambitious than Tomasulo’s reservation stations because it was multi-issue, not single-issue.
Herb Schorr — initial instruction-set designer; recruited Conway from his Columbia University circle.
Frances Allen — compilers (later Turing Award 2006).
Brian Randell — research/compiler.
Edward H. Sussenguth — architecture.
Gene Amdahl — increasingly the proponent of S/360 compatibility starting in 1965; eventually the man who would split off to found Amdahl Corp. in 1970.
John Earle — circuits; co-author with Amdahl of the AEC/360 proposal.

3.3 Why ACS was killed (May 1969)

The story is well-documented by Smotherman (“End of IBM ACS Project”) and Pugh (Building IBM):

1965 onward: Amdahl argued ACS-1 should be S/360-compatible. The ACS team resisted, insisting that compatibility would cap performance.
1967: Amdahl was effectively frozen out of ACS daily operations.
1968: Amdahl and John Earle, working partly outside ACS, sketched the AEC/360 (Amdahl-Earle Computer / 360-compatible) — a design they argued ran 15% faster than ACS-1 with 25% less hardware and preserved S/360 compatibility.
May 1968: IBM upper management held a “shoot-out” between the two designs and mandated the switch to S/360 compatibility. ACS-1 became ACS/360. The original ACS team was demoralized; many left over the next year.
May 1969: When Amdahl proposed three ACS/360 models for profitability, IBM’s marketing analysis showed only the full line could earn IBM’s standard 30% pre-tax margin; the high-end alone was a loss leader. Watson and Learson, with growing antitrust concerns about IBM “buying” the supercomputer market away from CDC, cancelled ACS entirely rather than introduce a money-losing high-end machine. Source: Smotherman, “End of IBM ACS Project.”

3.4 Why ACS matters

ACS-1 would have been more aggressive than the 360/91: multiple-issue out-of-order, a deeper instruction window, predicated execution, and an instruction set tuned for what we would now call ILP extraction. Conway’s DIS scheme would have been the first multi-issue dynamic scheduler in production. None of it shipped.

But the architectural ideas migrated. Cocke moved from ACS to IBM Yorktown’s 801 project in the mid-1970s — the first RISC machine — which became the basis for IBM POWER (1990) and PowerPC. ACS’s emphasis on simple instruction issue + aggressive compiler scheduling is the ancestor of RISC philosophy as a whole. Cocke’s Turing Award (1987) explicitly cites this lineage.

Conway, fired by IBM in 1968 after disclosing her gender transition (a firing IBM publicly apologized for in 2020), restarted her career at Memorex and then Xerox PARC, where with Carver Mead she co-authored Mead & Conway, Introduction to VLSI Systems, Addison-Wesley, 1980 — the textbook that revolutionized integrated-circuit design pedagogy. Her DIS work was rediscovered and credited only decades later. The chip-design revolution she catalyzed at PARC was a direct downstream consequence of her IBM exile. Source: Conway, “VLSI Reminiscences”; “In Memoriam: Lynn Conway (1938–2024),” Computer History Museum.

4. Customers and weather-science use

4.1 Total systems built

[DISPUTED] Sources do not agree on the exact count. Wikipedia: “only 15 Model 91s ever produced, four of which were for IBM’s internal use,” but immediately concedes “I have read and heard it authoritatively stated that the number was 10, 11, 12, 14, 15, or 20.” The user’s “approximately 20” is at the high end of plausibility and may include Model 95s and 195s. The most defensible blog-friendly figure is “around 15 units, including a few for IBM’s own use, with no more than about 11 going to outside customers.”

4.2 Confirmed external customers

This is partial — a complete list does not appear to exist in any single open source. Confirmed installations:

NASA Goddard Space Flight Center (Maryland) — Model 91, first delivery, October–November 1967; production January 1968. Plus a Model 95 (thin-film variant).
NASA Goddard Institute for Space Studies (Manhattan) — second Model 95.
Columbia University — Model 91 installed 1968, decommissioned November 1980 (a remarkably long run). Source: Columbia Computing History.
UCLA — Model 91 in use by 1971 providing production computing services (and ARPANET-connected service). Source: Wikipedia.
AEC labs: at least one of LASL (Los Alamos) and LLNL (Livermore) is widely cited as a 91 customer, but I could not confirm in the sources surveyed which lab(s) and when. [CHECK against Pugh / IBM internal records.]
Princeton / IDA: Princeton-area defense computing through the Institute for Defense Analyses’ Communications Research Division is sometimes cited as a 91 site. [CHECK]
Internal IBM: four units, locations not specified in public sources (likely Yorktown, Poughkeepsie, and field engineering training).

The user should cross-check against the IBM 360/91 Functional Characteristics manual (GA22-6907) and against the IBM customer-installation lists in Pugh (1995) for any external publication.

4.3 NCAR machine timeline

NCAR did NOT have a 360/91 or 360/195. This is worth being explicit about because it’s a common misconception.

CDC 3600 — 1963, NCAR’s first scientific machine.
CDC 6600 (serial number 7) — delivered late December 1965, moved to Mesa Lab December 1966, decommissioned May 1977. Operated for >11 years.
1970 evaluation: NCAR benchmarked CDC 7600 vs IBM 360/195. The 7600 won “by a slight margin.” NCAR chose the 7600.
CDC 7600 (serial number 12) — delivered May 1971, decommissioned April 1983.
Cray-1A s/n 3 — acquired 1977 alongside the 7600.
Cray-1A s/n 14 — replaced the 7600 in spring 1983.

Source: NCAR Computational and Information Systems Lab supercomputing history pages (cisl.ucar.edu/ncar-supercomputing-history).

So the user’s question “did NCAR have a 360/91 or 360/195 in between?” — No. NCAR went CDC 6600 → CDC 7600 → Cray-1, and explicitly declined the 195 in 1970.

4.4 GFDL machine timeline

GFDL’s machine history is the canonical NWP/climate-science use case for the 91 and 195:

IBM 7030 Stretch — 1963–1965 (relative power 40, 64-bit FP).
IBM 360/91 — 1969–1973 (relative power 400 — a 10× jump).
IBM 360/195 — 1974–1975 (relative power 800).
(Subsequently TI ASC, Cyber 205, Cray X-MP, etc.)

Source: P. N. Edwards, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming, MIT Press, 2010, online appendix; pne.people.si.umich.edu/vastmachine/i.GFDL.html.

Codes that ran on these machines included Smagorinsky’s nine-level hemispheric primitive-equation model, Manabe’s radiative-convective and GCM codes, the MARKFORT and Zodiac model series, and ocean GCM development by Bryan and Cox.

4.5 Manabe’s 1975 paper — which machine?

The classic paper:

S. Manabe and R. T. Wetherald, “The Effects of Doubling the CO₂ Concentration on the Climate of a General Circulation Model,” Journal of the Atmospheric Sciences, 32(1):3–15, January 1975. DOI: 10.1175/1520-0469(1975)032<0003:TEODTC>2.0.CO;2.

The paper itself does not state a machine; computational details are sparse. By GFDL’s published timeline (Edwards, A Vast Machine), the 1975 paper’s runs were performed on either the IBM 360/91 (in service at GFDL through 1973) or the IBM 360/195 that replaced it in 1974. Because the model integrations were extensive and the paper was submitted in 1974, the 1975 paper’s published runs were almost certainly computed on the 360/195, with earlier development and pilot runs on the 360/91.

(One source above suggests the calculations were done on “an IBM 7090 or CDC 6600” — that is wrong for the 1975 paper and reflects a confusion with Manabe’s much earlier 1967 radiative-convective work. Flag this as a widely-repeated error. [FLAG])

The 360/91 was the machine of GFDL’s transition from Stretch-era hemispheric models to the first practical global GCM runs. By the time the famous climate-sensitivity result was published in 1975, the science had moved to the 360/195, but the enabling architecture and the working environment were direct continuations of what GFDL learned on the 91.

4.6 The broader weather/climate footprint

Direct NWP/climate use of the 91:

GFDL — primary site for U.S. climate-modeling development.
NASA Goddard / GISS — both atmospheric science and Earth-observation processing (the Model 95s).
UCLA (Mintz–Arakawa GCM development moved to UCLA’s 360/91 in the early 1970s).

NCAR is a non-customer worth noting explicitly; the U.S. NWP community’s CDC-vs-IBM split in this era roughly tracks AEC/atmospheric-research labs (CDC) vs Eastern science establishment + Princeton/GFDL (IBM).

5. Widely-cited but suspect claims (flag list)

Claim	Status
360/91 announced November 1964	Year correct, month not verified in open sources — [CHECK]
~20 systems sold	High end; ~15 total (incl. internal) is more defensible — [DISPUTED]
FPU has “2 add + 2 mul/div units”	Wrong. 1 add unit (3 RS) + 1 mul/div unit (2 RS)
“No decimal arithmetic in some 91s”	Not supported by sources consulted — [FLAG]
Manabe 1975 ran on IBM 7090 / CDC 6600	Wrong. Ran on 360/195 (or possibly 360/91); the 7090/6600 attribution conflates with Manabe’s pre-1970 work
NCAR had a 360/91 or 195	Wrong. NCAR went 3600 → 6600 → 7600 → Cray-1, declined the 195 in 1970
Lynn Conway invented OoO at IBM	Correct but underspecified — she invented multi-issue dynamic scheduling at ACS, conceptually beyond Tomasulo’s single-issue design
ACS cancelled May 1969	Correct
Jenny Lake retreat September 1963	Correct (Pugh, Smotherman)

6. Primary and key secondary sources

Primary (the IBM Journal of Research and Development, January 1967, Vol. 11, No. 1 — the entire issue)

D. W. Anderson, F. J. Sparacio, R. M. Tomasulo, “The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling,” pp. 8–24.
R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” pp. 25–33. DOI 10.1147/rd.111.0025.
S. F. Anderson, J. G. Earle, R. E. Goldschmidt, D. M. Powers, “The IBM System/360 Model 91: Floating-Point Execution Unit,” pp. 34–53.
L. J. Boland, G. D. Granito, A. U. Marcotte, B. U. Messina, J. W. Smith, “The IBM System/360 Model 91: Storage System,” pp. 54–68.

IBM technical documentation

IBM System/360 Model 91 Functional Characteristics, IBM publication GA22-6907 (multiple editions; -2 is canonical). Available at bitsavers.org.
A. Padegs, “System/360 and Beyond,” IBM Journal of Research and Development 25(5):377–390, September 1981. (Retrospective, including frank discussion of imprecise interrupts.)

Histories

C. J. Bashe, L. R. Johnson, J. H. Palmer, E. W. Pugh, IBM’s Early Computers, Cambridge, MA: MIT Press, 1986. ISBN 0-262-02225-7. — primary reference for Stretch-era IBM and the lookahead heritage feeding the 91.
E. W. Pugh, Building IBM: Shaping an Industry and Its Technology, Cambridge, MA: MIT Press, 1995. ISBN 0-262-16147-8. — the standard reference for ACS, Jenny Lake, the Watson/Learson decisions, and the Amdahl/Earle pivot.
E. W. Pugh, L. R. Johnson, J. H. Palmer, IBM’s 360 and Early 370 Systems, MIT Press, 1991. — companion volume to IBM’s Early Computers; covers the Model 91 explicitly.
P. E. Ceruzzi, A History of Modern Computing, 2nd ed., MIT Press, 2003.
P. N. Edwards, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming, MIT Press, 2010. — GFDL machine timeline.

Architecture reference

J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann. The Tomasulo case study and CDC 6600 scoreboard comparison appears in chapter 3 (“Instruction-Level Parallelism”) of editions 4–6 (4th ed. 2007, 5th ed. 2011, 6th ed. 2017). The 5th and 6th editions have the most extensive coverage of the P6 revival of Tomasulo.
J. E. Smith and A. R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th International Symposium on Computer Architecture, pp. 36–44, 1985 — the reorder buffer that fixed the 91’s worst flaw.

Memoirs and online compendia

M. Smotherman, IBM ACS pages (people.computing.clemson.edu/~mark/acs.html, /acs_end.html, /acs_timeline.html, /acs_organization.html, /ibm360m91int.html). The single best free resource on ACS and on the 91’s interrupt problem.
L. Conway, “VLSI Reminiscences,” IEEE Solid-State Circuits Magazine, Vol. 4, No. 4, Fall 2012. — Conway’s first-person account of the ACS-era invention of multi-issue OoO scheduling and her IBM exile.
L. Conway and B. Randell et al., “IBM-ACS: Reminiscences and Lessons Learned from a 1960’s Supercomputer Project,” in Festschrift for Brian Randell, Springer LNCS, 2011. DOI 10.1007/978-3-642-24541-1_15.
J. E. Thornton, Design of a Computer: The Control Data 6600, Scott, Foresman & Co., 1970. — The other side of the comparison.

Web references actually consulted while preparing this file

Wikipedia “IBM System/360 Model 91”
Wikipedia “Tomasulo’s algorithm”
Wikipedia “IBM Advanced Computer Systems project”
Wikipedia “Lynn Conway”
Wikipedia “Pentium Pro”
Wikipedia “IBM 7030 Stretch”
Wikipedia “IBM System/360 Model 195”
NCAR CISL supercomputing history (cisl.ucar.edu/ncar-supercomputing-history/cdc7600 and /cdc6600)
Edwards, A Vast Machine online appendix on GFDL
bitsavers.org IBM 360/91 Functional Characteristics (GA22-6907)

Michał Brennek