Era 6 and Era 7: Breakthroughs and Failures of the Cluster and GPU Era (1991-2026)

Research compendium for Post 36 (“From Tubes to GPUs”). Compact “moments” (date, machine/project, people, breakthrough or failure, consequence, citation). Organised chronologically. The cluster era runs roughly 1991-2010, the GPU era 2006-2026, and they overlap heavily. Breakthroughs and failures interleave.

Foundation moments (1991-1995): the substrate of the cluster era

1991-08-25, Linus Torvalds, comp.os.minix announcement

The breakthrough: A 21-year-old Finnish CS student at the University of Helsinki posted a now-canonical message to the comp.os.minix Usenet newsgroup: “I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones.” The post was titled “What would you like to see most in minix?” Torvalds had been brewing the project since April 1991. Version 0.01 of what would be called Linux (Ari Lemke chose the name when creating the FTP subdirectory) was released 17 September 1991. Torvalds himself preferred “Freax.” Consequence: Without Linux, every commercial Unix vendor (Sun, IBM, HP, DEC, SCO, SGI) would have continued charging per-CPU licenses on every cluster node. Linux made the Beowulf-pattern cluster economically possible. By 2024, every TOP500 system on the planet ran Linux. Source: Linus Torvalds, comp.os.minix, 25 August 1991. Tom’s Hardware, “Linux is 34 years old today,” 25 August 2025. Wikipedia, “History of Linux.”

1992 (May), MPI-1 standardisation begins, MPI Forum at Supercomputing ‘91

The breakthrough: The MPI Forum – a consortium of supercomputing vendors and national-laboratory researchers – began standardising the Message Passing Interface in spring 1992. MPI-1 specification published May 1994. The library was a deliberately portable abstraction over the dozen incompatible message-passing libraries (PVM, P4, Chameleon, Express, Linda) that had previously fragmented distributed-memory parallel computing. Consequence: Every cluster supercomputer code from 1995 onwards is written in MPI plus a sequential language. The standardisation was the software-engineering precondition for cluster-based scientific computing. Source: Gropp, Lusk, and Skjellum, Using MPI, MIT Press, 1994. Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard,” May 1994.

1994 (summer), Beowulf-1 at NASA Goddard CESDIS

The breakthrough: Donald Becker (kernel networking developer, employed at CESDIS) and Thomas Sterling (research scientist at the Center of Excellence in Space Data and Information Sciences, a USRA division at NASA Goddard) built a 16-node Linux cluster of Intel 80486 DX4 processors at 100 MHz, each with 16 MB DRAM, connected by two channel-bonded 10 Mbit Ethernet networks, for an aggregate of approximately 500 MFLOPS. Sterling chose the name “Beowulf” off the cuff – his mother had majored in Old English and a copy of the epic was in his office. Consequence: The reference architecture for cluster-based scientific computing for the next thirty years. The September 1995 Sterling-Savarese-Becker-Dorband-Ranawake-Packer paper at the 24th International Conference on Parallel Processing was the canonical technical document. By 2010 the Beowulf pattern was the operational reality of every major weather-forecasting centre. Source: Sterling, T., Savarese, D., Becker, D. J., Dorband, J. E., Ranawake, U. A., and Packer, C. V. “Beowulf: A Parallel Workstation for Scientific Computation,” Proceedings of ICPP, 1995. Beowulf.org Overview/History page. NASA Spinoff 2020, “Beowulf Clusters Make Supercomputing Accessible.”

1996 (February), SGI buys Cray Research for $740M – the first failure in a chain

The failure: Silicon Graphics acquired Cray Research in February 1996 for $740 million. Cray Research’s workforce was slashed from over 4 500 to approximately 800 employees. Cray staffers later derisively referred to the SGI period as “the occupation.” SGI was attempting to fold Cray’s vector-supercomputing IP into its visualisation-and-graphics product line; the combination was an unmitigated cultural and financial failure. Consequence: Sets up the chain of events that produces “Cray Inc.” in 2000. The Cray Research vector lineage that had defined supercomputing from 1976 to 1995 effectively ended in 1996; the brand passed through three owners (SGI, Tera, then HPE in 2019) without fully recovering its 1980s identity. Source: Cray Inc. corporate history at companies.jrank.org. Encyclopedia.com, “Cray Inc.” Wikipedia, “Cray.”

The supercomputer-class cluster era (2000-2008)

2000-11, ASCI White at LLNL, the first multi-teraflop production system

The breakthrough: IBM delivered ASCI White to Lawrence Livermore National Laboratory under the Accelerated Strategic Computing Initiative. The machine had 8 192 IBM POWER3-II processors at 375 MHz across 512 nodes (16 CPUs per node), 6 TB main memory, 160 TB disk storage. Theoretical peak 12.3 TFLOPS, sustained Linpack 4.9 TFLOPS. ASCI White was made public on 15 August 2001. It weighed 106 tons and consumed 3 MW of electrical power plus 3 MW of cooling, and held the TOP500 #1 from November 2000 to November 2001. The first computer to exceed the double-digit teraflop barrier. Consequence: The DOE’s Stockpile Stewardship programme had specified, in 1995-1996, that simulation must replace nuclear testing; ASCI White was the first machine large enough to begin replacing the underground nuclear tests that had been suspended in 1992. The architectural lesson was that POWER-architecture clusters at this scale could be assembled, debugged, and operated in production. Every subsequent ASC and DOE leadership-class system through 2018 was a cluster of this general pattern. Source: TOP500 ASCI White page (top500.org). Wikipedia, “ASCI White.” LLNL ASC historic-decommissioned-machines page. IBM press release, 15 August 2001.

2002-03-11, Earth Simulator opens at Yokohama – the “Computenik” moment

The breakthrough: The Earth Simulator, built by NEC for JAMSTEC, opened on 11 March 2002 at the Yokohama Institute of Earth Sciences. Construction had begun October 1999, after a planning effort that started in 1997 with Hajime Miyoshi (then JAMSTEC Earth Simulator project director, a Cray Research alumnus). Architecture: 640 nodes of NEC SX-6-derived custom vector processors (eight per node), 5 120 vector processors total, 10 TB main memory, peak 40.96 TFLOPS, sustained Linpack 35.86 TFLOPS in June 2002. Project cost approximately ¥60 billion (around $500 M USD at contemporary exchange rates). Consequence: Earth Simulator was 4.875 times faster than the second-place ASCI White. Jack Dongarra of the University of Tennessee, the maintainer of TOP500, called it a “Computenik on our hands” – a deliberate reference to the 1957 Sputnik shock to American science. The US response was the High Productivity Computing Systems programme (DARPA, 2002-2010) and the doubling of NSA and DOE supercomputing budgets through 2003-2006. Earth Simulator held TOP500 #1 for five consecutive lists (June 2002 to June 2004), the longest reign of any machine on TOP500. It produced the first global atmosphere-ocean simulations at 10 km resolution. Source: TOP500 Earth Simulator page. Wikipedia, “Earth Simulator.” HPCwire 17 January 2003, “The Computenik from Yokohama.” Phys.org news 23 June 2004.

2004-11, Blue Gene/L at LLNL takes TOP500 #1

The breakthrough: A 16-rack Blue Gene/L (1 024 compute nodes per rack, 16 384 nodes total) achieved 70.72 TFLOPS Linpack in November 2004, displacing Earth Simulator. The architecture, originated by Monty Denneau at IBM Watson with input from the Cyclops64 project and developed under the joint DOE-IBM Advanced Architecture Research programme, used embedded 700 MHz PowerPC 440 cores with custom dual-pipeline double-precision FPUs and a built-in DRAM controller. Each ASIC carried two cores and the supporting communication logic. Tilak Agerwala at IBM Research (then VP) was Blue Gene programme executive; Manish Gupta led the system software. Consequence: Blue Gene/L held TOP500 #1 from November 2004 to June 2008 (3.5 years – the second-longest reign after Earth Simulator), peaking at 478 TFLOPS in November 2007 with the LLNL system fully populated. The architectural lesson was that energy efficiency at scale required low-power embedded cores, not desktop-class CPUs. Blue Gene’s roughly 200 MFLOPS/W in 2004-2008 was four to ten times more efficient than ASCI White, foreshadowing the GPU-driven energy efficiency of the 2010s. Blue Gene/L won the SC20 “Test of Time” award in 2020. Source: Wikipedia, “IBM Blue Gene.” LLNL article, “LLNL, IBM win SC20 ‘Test of Time’ for Blue Gene/L.” LLNL ASC historic-decommissioned-machines page.

The GPU origin story (2003-2008)

2003-2004, “Brook for GPUs” at Stanford, Ian Buck

The breakthrough: Ian Buck, a CS Ph.D. student in Pat Hanrahan’s group at Stanford, developed Brook – a streaming programming language for general-purpose computing on graphics hardware. Brook for GPUs was published at SIGGRAPH 2004 (Buck, Foley, Horn, Sugerman, Fatahalian, Houston, Hanrahan). Brook abstracted the fragment-shader pipeline of programmable GPUs as a stream-and-kernel model, hiding the DirectX/OpenGL graphics conventions from scientific users. Buck previously interned at NVIDIA during his undergraduate years at Princeton; his Stanford work was DARPA-funded. Consequence: NVIDIA hired Buck in 2004. Buck and John Nickolls (NVIDIA’s director of GPU architecture) evolved Brook into CUDA. Without the Stanford streaming-computation precedent, the 2006 G80 architectural decision would not have included programmability for non-graphics workloads. Source: Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P. “Brook for GPUs: stream computing on graphics hardware,” ACM Transactions on Graphics 23(3):777-786, 2004. Communications of the ACM article, “The Origins of GPU Computing.”

2006-11-08, NVIDIA G80 / GeForce 8800 GTX – the first programmable SIMT GPU

The breakthrough: NVIDIA launched the GeForce 8800 GTX on 8 November 2006. Architecture codenamed G80: 681 million transistors on a 480 mm² die (90 nm), the largest commercial GPU ever fabricated to that date. 128 stream processors (later renamed CUDA cores) running in lockstep groups (warps) of 32 threads under Single Instruction Multiple Thread organisation – the architectural innovation that distinguished SIMT from earlier SIMD by allowing per-thread divergence with hardware-managed mask predication. 768 MB GDDR3 across a 384-bit memory bus. The first consumer card to abandon dedicated pixel-and-vertex shaders for unified stream processors. Consequence: The G80 is the architectural ancestor of every modern NVIDIA GPU through 2026. The unified-shader architecture allowed the same hardware to run graphics or general-purpose compute. Combined with CUDA the following year, G80 began the transformation of GPUs from gaming peripherals into the computational substrate of scientific computing and machine learning. Source: Wikipedia, “GeForce 8 series.” NVIDIA Fermi white paper. ExtremeTech, “10 years ago, Nvidia launched the G80-powered GeForce 8800.”

2007-06-23, NVIDIA CUDA 1.0 released

The breakthrough: NVIDIA released CUDA Toolkit 1.0 on 23 June 2007 (officially announced 25 June). The initial CUDA SDK had been made available 15 February 2007 for Windows and Linux. CUDA was a C-language extension with __global__, __device__, and __host__ function attributes, and a thread-block-grid programming model that mapped naturally onto the SIMT hardware of G80. Version 1.0 added asynchronous kernel calls and 64-bit Linux support. Consequence: Within five years CUDA was the dominant general-purpose GPU computing language. Every major scientific computing library (cuBLAS, cuFFT, MAGMA, ScaLAPACK on GPUs) was built on CUDA. NVIDIA’s investment in CUDA – amounting to several billion dollars across 2006-2014 – was the strategic bet that transformed the company from a gaming-graphics vendor into the dominant scientific-computing and AI silicon supplier. Source: NVIDIA Developer Forums, “CUDA 1.0 Released,” June 2007. Wikipedia, “CUDA.” InsideHPC, July 2007.

2008 (Q4), NVIDIA stock falls 80%, Jensen Huang’s bet on CUDA at the brink of bankruptcy

The crisis: NVIDIA’s stock dropped more than 80% from October 2007 to November 2008 during the global financial crisis. Demand for discrete graphics cards collapsed; gaming revenue fell precipitously. The “bumpgate” defect crisis added a $196 million Q2 fiscal-2009 charge against cost of revenue to cover warranty, repair, and replacement of laptop GPUs and MCPs (notebooks using GeForce 6000, 7000, 8000, early 9000, and MCP5x/6x/7x parts) that failed at higher than normal rates due to a weak die/packaging material set. Jensen Huang – NVIDIA’s co-founder and CEO since the company’s founding in 1993 – said NVIDIA was “one month away from going out of business.” The bet: Huang refused to abandon CUDA. The company continued to invest heavily in GPU compute despite investors demanding a course correction. Gross margins suffered through 2009-2010. Consequence: When AlexNet won ImageNet on 30 September 2012 using two GeForce GTX 580s, NVIDIA was the only company in the world with a mature GPU-compute software stack (CUDA, cuDNN, cuBLAS) that machine-learning researchers could pick up and use. The 2008 bet became the foundation of NVIDIA’s $3+ trillion market capitalisation by 2024-2025. The 2009 near-collapse and the 2012 AlexNet rescue is, together, one of the most consequential corporate-strategic episodes in 21st-century technology history. Source: Inc. Magazine, “3 Leadership Lessons From a CEO Who Turned an Idea Into a $1.4 Trillion Company.” The Register July 2008, “Nvidia throws itself under the bus with chip defect, delays and lost sales.” Wikipedia, “Jensen Huang.”

The hybrid era and the petaflop barrier (2008-2012)

2008-05-25, IBM Roadrunner at LANL breaks the 1 petaflop barrier

The breakthrough: Roadrunner sustained 1.026 petaflop/s on Linpack on 25 May 2008 (its fourth attempt; the first three runs failed due to system issues). System cost approximately $121-133 million depending on accounting (your draft cites $133 M; some sources $121 M). Architecture: 12 960 IBM PowerXCell 8i Cell Broadband Engine accelerators paired with 6 480 dual-core AMD Opteron host CPUs across 3 060 compute nodes, networked by InfiniBand. Total 116 640 cores. Programme leader at LANL: Andy White. ASC programme manager: Kenneth Koch. Don Grice was IBM lead architect. Project planning began at LANL in 2002. Consequence: First petaflop machine in production. Roadrunner held TOP500 #1 from June 2008 to November 2009 (three lists). The hybrid CPU-plus-accelerator architectural pattern – each compute node has a host CPU plus one or more accelerator chips – became the dominant architectural pattern of every leadership-class system from 2012 onwards (Titan, Summit, Sierra, Frontier, Aurora). Roadrunner, however, used Cell as the accelerator; subsequent systems used GPUs. The architectural pattern won; the specific accelerator choice did not. Source: IBM history page, “Breaking the petaflop barrier.” Wikipedia, “Roadrunner (supercomputer).” TOP500 Roadrunner page. HPCwire, “IBM Roadrunner Takes the Gold in the Petaflop Race,” 9 June 2008. IEEE Xplore, “Entering the petaflop era: The architecture and performance of Roadrunner.”

2008-11, Tsubame 1.2 at Tokyo Tech – first GPU-accelerated TOP500 entry

The breakthrough: Satoshi Matsuoka at the GSIC Center, Tokyo Institute of Technology, upgraded the existing Tsubame 1.0 (commissioned 2006) by adding 170 NVIDIA Tesla S1070 server racks (each containing four Tesla T10 GPUs) for a total of 680 Tesla GPUs alongside the existing AMD Opteron host nodes. The hybrid system, designated Tsubame 1.2, achieved 77.48 TFLOPS Linpack in November 2008. First major GPU-accelerated supercomputer to make the TOP500. Consequence: Validated NVIDIA’s CUDA strategy at the supercomputing scale. Subsequent systems Tsubame 2.0 (2010, 2.4 PFLOPS, NVIDIA Tesla M2050), Titan (ORNL 2012, 17.59 PFLOPS, NVIDIA K20X), and onwards built on the architectural template Matsuoka had piloted. Matsuoka received the 2022 IEEE Computer Society Sidney Fernbach Award for the Tsubame work. The architectural lesson Tsubame proved was that COTS GPUs at supercomputer scale produced higher-Flop/W density than any CPU-only design. Source: Wikipedia, “Tsubame (supercomputer).” Tokyo Tech News, “Supercharging a supercomputer,” 2009. PCWorld, “Inside Tsubame – the Nvidia GPU Supercomputer.” HPCwire, “The Second Coming of TSUBAME,” October 2010.

2009-11, IBM cancels PowerXCell successor – Cell Broadband Engine commercially abandoned

The failure: David Turek, IBM VP of Deep Computing, confirmed in November 2009 that there would be no successor to the PowerXCell 8i. The rumoured PowerXCell 32i was cancelled internally. By January 2012 IBM had discontinued the entire QS22 and QS21 Cell-based blade server line. Cell Broadband Engine had been jointly developed by Sony, Toshiba, and IBM (the STI alliance, formed 2001) with over 400 engineers across a four-year programme starting March 2001. The architecture combined a PowerPC PPE with eight Synergistic Processing Elements (SPEs). The reason: Programming the Cell required hand-tuned software for the heterogeneous PPE-and-SPE architecture, with explicit DMA management between the SPE local stores and main memory. The programming model defeated all but a handful of expert teams. Lisa Su, then at IBM and later AMD CEO, helped lead the chip’s design and later spoke about its “infamous” reputation. The development cost was enormous; the only major non-PlayStation customer was Roadrunner, a one-off DOE machine. Consequence: Cell, despite its architectural sophistication, lost the GPGPU race to NVIDIA’s CUDA-on-GPU because of its programming model. By the time Roadrunner was decommissioned on 31 March 2013 – five years after the petaflop run – Cell-based blade servers were discontinued and there was no upgrade path. The Roadrunner facility was retired because it consumed 2 345 kW for 1.042 PFLOPS (444 MFLOPS/W), versus comparable 2013-era machines at 888+ MFLOPS/W. Cell is the canonical case-study failure of an architecturally clever processor that lost the market because nobody could programme it productively. Source: HPCwire, “IBM Cuts Cell Loose,” 24 November 2009. InsideHPC, “No future development for IBM’s PowerXCell,” 2009. Wikipedia, “Cell (processor).” Tom’s Hardware on Lisa Su and the PS3 Cell. LANL Daily Post, “End of the Road for Roadrunner,” March 2013.

2009-12-04, Intel cancels Larrabee discrete-GPU launch – “the project that would have changed AI”

The failure: Intel announced 4 December 2009 that the first-generation Larrabee would not be released as a consumer GPU product. Larrabee, in development since 2006-2007 and announced at SIGGRAPH 2008, was Intel’s first attempt at a programmable many-core x86-derived processor that would compete with NVIDIA on graphics and on general-purpose compute. The architecture used a vectorised version of the original Pentium core with a 16-wide SIMD unit per core, scaling to dozens of cores per chip. The cancellation reason: silicon and software development were behind schedule, and the projected performance was approximately one-fifth of competing graphics boards at the planned launch date. The discrete-GPU programme was formally terminated in May 2010. The Gelsinger context: Pat Gelsinger was Intel’s chief architect during the Larrabee programme; he left Intel in 2009 partly over the cancellation. He returned as CEO in February 2021. At a 2024 NVIDIA GTC fireside chat with Jensen Huang, Gelsinger said that “had Intel stayed on that path, the future could have been different at that point.” At a separate 2024 MIT session, he said his exit from Intel was triggered when management “killed the project that would have changed the shape of AI.” Consequence: Larrabee’s IP was rolled into the Xeon Phi (Knights Ferry, Knights Corner, Knights Landing, Knights Mill, Knights Hill) – the manycore x86-derived accelerator that Intel marketed against NVIDIA GPUs from 2010-2018. None of the Phi line ever caught up with NVIDIA’s GPU+CUDA combination. Intel’s missed shot at the GPU market is the most consequential strategic mistake of the 21st-century semiconductor industry; it is what allowed NVIDIA to dominate AI compute. Source: Wikipedia, “Larrabee (microarchitecture).” HPCwire, “Intel Cancels 2010 Larrabee Debut,” 7 December 2009. SlashGear, “Intel’s Larrabee GPU Could Have Rivaled Nvidia, So Why Was It Discontinued?” IEEE Computer Society “Chasing Pixels: Intel’s Fourth Graphics Attempt.” TechSpot, “The Last Time Intel Tried to Make a Graphics Card.”

The cluster matures (2010-2014)

2011-10-12, AMD Bulldozer launches – six years of failure for AMD CPUs

The failure: AMD launched the FX-8150 (“Zambezi”), the first Bulldozer-microarchitecture CPU, on 12 October 2011. The architectural innovation was the “module” – two integer cores sharing a single floating-point unit, fetch/decode logic, and L1 instruction cache – intended to maximise core count in a given die area. AMD marketed eight-core FX parts (the 8150 advertised eight cores; actually four modules of two integer cores each). In AnandTech’s launch review, the FX-8150 was 40-50% slower in single-threaded workloads than comparable Intel Sandy Bridge i7 parts. AMD VP Andrew Feldman later called Bulldozer “without doubt an unmitigated failure.” A 2015 class-action lawsuit settled in 2019 for $12.5 million on the eight-cores marketing claim. The recovery: AMD’s 2012-2017 CPU revenue collapsed; the company nearly lost its server business entirely. Lisa Su became CEO 8 October 2014 and inherited the wreckage. Su ordered Bulldozer abandoned and a clean-sheet Zen architecture commissioned (under chief architect Jim Keller, who had returned to AMD in 2012 specifically for this project). Zen launched as Ryzen 1000 in March 2017 with a 52% IPC gain over Bulldozer’s last revision (Excavator) – the largest single-generation IPC improvement in modern x86 history. Consequence: AMD lost most of a decade in CPU competitiveness. By the time Zen recovered the desktop market in 2017-2019 and the server market in 2019-2024, NVIDIA had captured the entire AI-compute opportunity. Bulldozer is the architectural failure that delayed AMD’s competitiveness with Intel by six years – but it does not explain why neither AMD nor Intel reached AI compute in time. Source: Wikipedia, “Bulldozer (microarchitecture).” Tom’s Hardware, “AMD Settles FX Bulldozer False Advertising Lawsuit.” Anandtech, “AMD Bulldozer ‘Core’ Lawsuit,” 2019. ChipsAndCheese, “Bulldozer, AMD’s Crash Modernization.”

2012-09-30, AlexNet wins ImageNet, the deep-learning rescue moment

The breakthrough: Alex Krizhevsky, working in Geoffrey Hinton’s University of Toronto group with Ilya Sutskever as collaborator, submitted a convolutional neural network entry to the ImageNet 2012 Challenge (ILSVRC). The model, eight layers deep with 60 million parameters and 650 000 neurons, achieved a top-5 error rate of 15.3%, more than 10.8 percentage points lower than the runner-up (the second-place team, with a hand-engineered SIFT-and-Fisher-Vectors approach, reached 26.2%). AlexNet was trained for 90 epochs over five to six days on two NVIDIA GeForce GTX 580 GPUs (3 GB VRAM each) in Krizhevsky’s bedroom at his parents’ house in Toronto. Because the model did not fit in a single GPU’s VRAM, it was split into two parallel pipelines that synchronised at three layers. The Hinton-Sutskever collaboration: Sutskever, then a Hinton Ph.D. student, had convinced Krizhevsky – whom he knew was the strongest GPGPU programmer in the lab – to train a convolutional network on ImageNet. Hinton was the principal investigator; the NeurIPS 2012 paper has him as third author. Krizhevsky, Sutskever, and Hinton sold their three-person company DNNresearch to Google in March 2013 for an undisclosed sum (rumoured at $44 million). Consequence: At the 2012 European Conference on Computer Vision, Yann LeCun described AlexNet as “an unequivocal turning point in the history of computer vision.” Before AlexNet, almost no leading computer-vision papers used neural nets; after it, almost all of them did. Within five years, every major image-recognition system was built on a CNN trained on GPUs. AlexNet is the moment when general-purpose GPU computing transitioned from a niche scientific application to the substrate of an entire economic sector. The dataset itself – ImageNet, 1.2 million training images across 1 000 categories, distilled from a 12-million-image WordNet ontology – was the work of Fei-Fei Li’s group at Stanford and Princeton, with crowdsourced labelling by 49 000 Mechanical Turk workers across 167 countries from July 2008 to April 2010. Source: Krizhevsky, A., Sutskever, I., Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks,” NeurIPS 2012. Wikipedia, “AlexNet.” IEEE Spectrum, “How AlexNet Transformed AI and Computer Vision Forever.” Pinecone Learn, “AlexNet and ImageNet: The Birth of Deep Learning.”

The exascale push and the failures along the way (2013-2018)

2013-09 to 2016-10, ECMWF Cray XC30 procurement and operational migration

The breakthrough: ECMWF awarded Cray a contract in June 2013 for two Cray XC30 supercomputers (codenamed CCA and CCB at Reading) and a Cray Sonexion storage system, replacing the IBM Power7-based system that had been in production. ECMWF and Cray celebrated successful migration of operational suites in September 2014. A subsequent $36 million upgrade contract was signed January 2016, with system deliveries through 2016 reaching production by October 2016. Sustained performance approximately 200 TFLOPS (versus the predecessor IBM Power7’s roughly 70 TFLOPS sustained); peak approximately 3 600 TFLOPS. Consequence: The Cray XC30 era at ECMWF (2014-2020) was the final commodity-x86-CPU supercomputer generation at the Centre. The next system (Atos BullSequana XH2000, contracted January 2020, in production 2021 at the new Bologna data centre) used AMD Rome/Milan CPUs but no GPU acceleration – a deliberate institutional decision based on the production-readiness of the IFS code on x86 vector instructions. The XC30 was the last clean-vector-CPU operational generation at ECMWF before the GPU and ML-augmented era. Source: ECMWF Newsletter 147, “Supercomputer upgrade is under way.” InsideHPC January 2016, “ECMWF to Upgrade Cray XC Supercomputers.” HPCwire January 2020, “Atos-AMD System to Quintuple Supercomputing Power at ECMWF.”

2015-09, Bauer-Thorpe-Brunet “Quiet Revolution” Nature paper

The breakthrough: Peter Bauer (ECMWF), Alan Thorpe (then ECMWF Director-General), and Gilbert Brunet (then Environment and Climate Change Canada) published “The quiet revolution of numerical weather prediction” in Nature 525:47-55, 3 September 2015. The paper documented the steady accumulation of forecasting skill at the major centres – ECMWF, NCEP, Met Office, JMA, DWD – from approximately 1980 to 2015. The “quiet revolution” framing was their specific phrase: NWP improvement had been a steady accumulation of scientific knowledge and technological advances, not a series of fundamental physics breakthroughs. Consequence: The Bauer-Thorpe-Brunet paper became the standard reference for everyone who needed to explain to a non-meteorology audience that operational weather forecasting had quietly become one of the most successful applied-science enterprises of the 20th century. The paper’s specific claim – that ECMWF’s day-3 forecast skill in 2015 was as good as the day-2 skill of 1990 – was the canonical statistic for the “quiet revolution.” Bauer subsequently led the EU’s Destination Earth digital-twin programme. Source: Bauer, P., Thorpe, A., Brunet, G. “The quiet revolution of numerical weather prediction,” Nature 525:47-55, 2015. ECMWF news article, 2015.

2017-2018 (mid 2017 announcement), Intel kills Knights Hill – Aurora reborn

The failure: Intel announced mid-2017 that Knights Hill – the planned third generation of Xeon Phi, intended to power the original Aurora pre-exascale system at Argonne National Laboratory in 2018 – had been cancelled. Knights Hill was to deliver 180-200 PFLOPS in the original DOE contract. The cancellation forced DOE to rework the Aurora contract: target performance was raised to over 1 EFLOPS (an exascale system), the deployment date pushed to 2021, and the architecture changed to Intel Sapphire Rapids CPUs plus Ponte Vecchio GPUs. The full Knights Mill (Xeon Phi 7290) was the last generation; Intel announced the discontinuation of the entire Xeon Phi line on 27 July 2018, with last orders accepted to August 2018 and final shipment July 2019. Why it failed: General-purpose GPUs had a several-year head-start in CUDA tooling and software ecosystems. Intel marketing initially heralded Phi as “as easy to use as Xeon,” claiming a programmability advantage over NVIDIA GPGPUs; in practice users found that meaningful performance required extensive code rewrites and careful tuning. The 10 nm fabrication delays at Intel further damaged the cost equation. Aurora finally arrived at Argonne in November 2023 (TOP500 debut) and broke the exascale barrier in May 2024 at 1.012 EFLOPS – nine years late, with completely different silicon, on a contract that had been rewritten twice. Consequence: Intel’s roughly fifteen-year, billion-dollar manycore programme – Larrabee, Knights Ferry, Knights Corner, Knights Landing, Knights Mill, Knights Hill – ended in 2018 with no surviving product. The 27 July 2018 Phi cancellation is the moment that Intel admitted the GPU war was lost. Every subsequent DOE leadership-class system used either NVIDIA or AMD GPUs. Source: TOP500, “Intel Dumps Knights Hill, Future of Xeon Phi Product Line Uncertain,” 2017. HPCwire, “Requiem for a Phi: Knights Landing Discontinued,” 25 July 2018. AnandTech, “The Larrabee Chapter Closes: Intel’s Final Xeon Phi Processors Now in EOL.” HPCwire, “Aurora the Survivor: Exascale Supercomputer Arrives After Eight Years of Doom,” 13 November 2023. Tom’s Hardware on Knights Mill and Knights Landing in LLVM.

2017-2022, the cryptocurrency-mining GPU shortage – scientific computing displaced by speculation

The failure: From 2017 onwards, GPU prices were repeatedly inflated by cryptocurrency-mining demand. The first major shortage in 2017 was driven by Ethereum mining (when Bitcoin and altcoin prices rose >1 000%). The second, much larger 2021 shortage saw Jon Peddie Research estimate that 25% of GPUs shipped in Q1 2021 went to cryptocurrency miners and speculators. Bloomberg estimated cryptocurrency miners spent approximately $15 billion on GPUs from 2021 onwards. Some 2021-era cards retailed at four times MSRP; many ML and scientific-computing groups could not procure GPUs at any reasonable price. The shortage ended September 2022, when Ethereum’s “merge” to proof-of-stake made GPU mining unprofitable. Consequence: For five years the substrate of scientific computing was being competed for by speculative cryptocurrency mining. Academic ML labs and university HPC centres systematically lost access to current-generation GPUs through 2017-2022; only the largest commercial AI labs (OpenAI, Google DeepMind, Anthropic, Meta FAIR) and well-funded national HPC centres could secure adequate compute. The shortage materially delayed academic AI research and shifted the competitive advantage decisively toward well-resourced industrial labs. Ethereum’s proof-of-stake transition in September 2022 ended the shortage just as the AI demand-shock was about to begin. Source: Wikipedia, “GPU mining.” Coindesk, “Morgan Stanley: GPU Demand Likely to Slow if Ethereum Moves to Proof-of-Stake,” June 2022. Vice, “Cryptocurrency Mining Is Fueling a GPU Shortage.” Priceonomics, “How Much Did Cryptocurrency Mining Inflate GPU Prices?” AI Impacts Wiki, “Factors that affect the price of GPUs.”

2018-01-03, Spectre and Meltdown disclosed – speculative execution as a security disaster

The failure: Spectre (CVE-2017-5753, CVE-2017-5715) and Meltdown (CVE-2017-5754) were publicly disclosed on 3 January 2018, after a six-month embargo (vendors notified 1 June 2017). The two side-channel vulnerabilities exploited the speculative-execution behaviour of out-of-order CPUs to leak data across security boundaries. Meltdown affected almost every Intel CPU since 1995 (except Itanium and pre-2013 Atom); Spectre affected essentially every modern CPU including AMD and ARM. Patching required microcode updates plus operating-system kernel changes (KPTI – Kernel Page-Table Isolation), with performance penalties of 5-30% on I/O-heavy workloads. Microsoft, Linux, Apple, and major cloud vendors emergency-patched through January-February 2018. On 28 January 2018 it was reported that Intel had shared news of the vulnerabilities with Chinese technology companies before notifying the US government. Consequence: Speculative execution, the architectural innovation that had defined high-performance CPUs since the 1995 Pentium Pro (the modern revival of Tomasulo’s 1967 algorithm – the same algorithm Post 31 in this series describes), became a security liability rather than an unalloyed win. A long sequence of follow-on side-channel disclosures (Foreshadow, Zombieload, RIDL, Fallout, MDS, Downfall, Inception, Reptar, GhostRace, ZenBleed) extended the era of speculative-execution-vulnerability patching into 2024. The architectural lesson is that the per-CPU-cycle performance benefits of speculation could not be cleanly separated from cross-tenant security risk. Cloud computing – where untrusted tenants share a single CPU – has had to live with this trade-off ever since. Source: Wikipedia, “Spectre (security vulnerability)” and “Meltdown (security vulnerability).” CISA, “Meltdown and Spectre Side-Channel Vulnerability Guidance,” 4 January 2018. Meltdownattack.com. IEEE Spectrum, “How the Spectre and Meltdown Hacks Really Worked.”

The exascale era and the ML-forecasting revolution (2018-2026)

2018-06, Summit at ORNL takes TOP500 #1, US returns to the top

The breakthrough: Summit – IBM Power9 plus NVIDIA Volta GPUs at the Oak Ridge Leadership Computing Facility – debuted on the TOP500 in June 2018 at 122.3 PFLOPS sustained, with theoretical peak above 200 PFLOPS. Architecture: 4 608 IBM AC922 nodes, each with two POWER9 22-core CPUs and six NVIDIA Tesla V100 GPUs (16 GB or 32 GB HBM2 per GPU), interconnected by NVLink within each node and dual-rail Mellanox EDR InfiniBand between nodes. Each node had over 600 GB of coherent memory. Summit displaced Sunway TaihuLight from TOP500 #1 – the first time the US had topped the list since June 2012. Consequence: Summit was the production substrate for some of the first deep-learning-meets-HPC scientific applications. It produced the first exascale-class mixed-precision result on a science workload (1.88 EFLOPS on a non-Linpack benchmark) and was instrumental in early training experiments for large weather-and-climate models. Summit held TOP500 #1 from June 2018 to June 2020 and was decommissioned 15 November 2024. Source: Wikipedia, “Summit (supercomputer).” TOP500 Summit page. ORNL news release, “ORNL’s Summit Supercomputer Named World’s Fastest.” HPCwire, June 2018.

2017-06-19, Sunway TaihuLight at Wuxi, the first non-Western TOP500 #1 to use indigenous silicon

Earlier context (filling chronological gap): Sunway TaihuLight, at China’s National Supercomputing Center in Wuxi, took TOP500 #1 in June 2016 at 93 PFLOPS Linpack. The June 2017 list (the second time it held #1) is notable because it was the second time in TOP500’s then-24-year history that the United States had failed to place any system in the top three. TaihuLight uses 40 960 SW26010 processors – a 260-core RISC chip designed by China’s National Research Center of Parallel Computer Engineering & Technology (NRCPC) – with a custom interconnect and zero Western silicon. Project lead: Guangwen Yang at Tsinghua University and the National Supercomputing Centre in Wuxi. The 2016 Gordon Bell Prize was awarded for an atmospheric simulation code (10M-core CAM5) running on TaihuLight. Consequence: The 2016 TaihuLight result, repeated in 2017, marked China’s emergence as a complete-stack supercomputing power. The US response was the renewed exascale push under DOE that produced Frontier in 2022. China subsequently, through 2018-2020, removed itself from voluntary TOP500 reporting – the official TOP500 list in 2020-2024 systematically undercounts top-tier Chinese systems including the Tianhe-3 and the OceanLight (Sunway successor). Source: Wikipedia, “Sunway TaihuLight.” TOP500, “China Tops Supercomputer Rankings with New 93-Petaflop Machine,” June 2016. TOP500 June 2017 list. Datacenter Knowledge, “China Continues to Rule HPC.”

2020-06-22, Fugaku at RIKEN takes TOP500 #1 with ARM-based silicon

The breakthrough: Fugaku, jointly developed by RIKEN R-CCS and Fujitsu, debuted at TOP500 #1 on 22 June 2020 with 415.5 PFLOPS Linpack. By November 2020 Fugaku reached 442.0 PFLOPS, plus 1.42 EFLOPS on the mixed-precision HPL-AI benchmark. Architecture: 158 976 Fujitsu A64FX 48-core ARM SoCs (the first ARM design to incorporate Scalable Vector Extension SVE-512), connected by Fujitsu’s proprietary Tofu interconnect D in a six-dimensional torus. Fugaku is named after Mount Fuji. Consequence: The first ARM-architecture system to top TOP500. Fugaku’s design choice – ARM with SVE, no GPUs – was a deliberate Japanese sovereign-technology decision driven by JAMSTEC and RIKEN’s preference for code-portability across the existing K computer code base. Fugaku held TOP500 #1 from June 2020 to June 2022 (four lists). For atmospheric simulation it produced the world’s highest-resolution global atmosphere simulations: 1 600-member ensembles for disaster-prevention prediction, NICAM at 220 m horizontal mesh size, and a 1 024-member ensemble data assimilation at 3.5 km mesh in 2020-2022. Source: Wikipedia, “Fugaku (supercomputer).” TOP500 Fugaku page. RIKEN news release, 23 June 2020. HPCwire, “Japan’s Fugaku Tops Supercomputing List 415 Petaflops,” 22 June 2020.

2020-2025, the slow death of Itanium / IA-64

The failure: Itanium had been Intel’s intended 64-bit-replacement-for-x86 architecture, jointly developed with HP from approximately 1989 (under HP’s PA-WideWord research) and launched commercially as Itanium 1 (Merced) in June 2001 – two years late. The architecture was a Very Long Instruction Word design using Explicit Parallel Instruction Computing (EPIC), pushing instruction-level-parallelism extraction onto the compiler rather than hardware. Itanium was nicknamed “Itanic” (after the Titanic) by the trade press because of its troubled trajectory. AMD’s competing 64-bit x86 extension (x86-64, AMD64), released in the Opteron in April 2003, ran legacy 32-bit x86 software natively while Itanium did not, and AMD64 won the 64-bit transition uncontested. Microsoft announced in 2010 that Windows Server 2008 R2 would be the last server OS supporting Itanium. Intel’s last Itanium order date was January 2020; final shipments July 2021. HP-UX 11i v3 release 2505.11iv3 (final) was released 22 May 2025; HP-UX support ended 31 December 2025. Linux IA-64 support was removed from the kernel in November 2023. The architecture’s 24-year lifespan ended having captured under 1% of the market it was originally intended to replace. Consequence: Itanium was the largest sustained architectural failure of the post-2000 silicon era. Total Intel and HP investment in IA-64 has been estimated by industry analysts at over $20 billion across two decades. The lesson – compiler-only ILP extraction does not work well enough at execution time to compete with hardware out-of-order issue – was painfully relearned by Intel a decade after Tomasulo’s algorithm’s revival on the Pentium Pro. The supercomputer market briefly considered Itanium (the original Aurora at Argonne, the original Tianhe-2A in China, several SGI Altix systems) but never deeply adopted it. The IA-64 death is also a marker of how long it takes a misbet to fully die. Source: Wikipedia, “Itanium.” Tom’s Hardware, “Itanium Waves Goodbye As Intel Delivers Last Shipments.” OSnews, “The Itanic Saga,” and “HP-UX hits end-of-life today.” Slashdot/Linux IA-64 support removed November 2023. The Register, “The last supported version of HP-UX is no more,” 5 January 2026.

2020-2024, the quantum-computing winter for scientific applications

The failure: Quantum-computing research saw sustained capital investment and technical progress through 2020-2024 – IBM, Google, IonQ, Rigetti, PsiQuantum, Quantinuum, and others all announced systems above 100 qubits, with some exceeding 1 000 qubits in 2024. However, none of these systems achieved a usable advantage over classical digital computers on a real scientific problem. Forrester Research noted in 2024 that “no one has publicly shown a problem solved by a quantum computer that is super usable in the real world.” PsiQuantum shifted its 1-million-qubit roadmap from 2025 to 2027. IBM’s 2033 roadmap target is 100 000 qubits. Total quantum-computing-company revenue across 2024 remained under $750 million globally. The principal scientific applications (chemistry, fluid dynamics, materials simulation) remained at proof-of-concept scale, with no production deployments at any major scientific institution. Consequence: From the perspective of operational scientific computing – weather, climate, materials, drug discovery – the 2020-2024 period was a quantum winter. The hype-and-investment cycle peaked around 2021 and contracted measurably from 2022 onwards as AI absorbed the investment that might otherwise have gone to quantum. As of mid-2026, no atmospheric-science workload runs usefully on a quantum machine. The principal proposed application – quantum simulation of fluid dynamics for high-Reynolds-number turbulence – is at approximately the same stage of maturity that GPU-based atmospheric forecasting was at in 2010. Source: Forrester Research via TechNewsWorld, December 2024. CNBC, “Quantum computing is having a moment. But the technology remains futuristic,” June 2025. Boston Consulting Group, “The Long-Term Forecast for Quantum Computing Still Looks Bright,” 2024. Scaleway Blog, “Quantum computing in 2024: The State of Play.”

2022-05-30, Frontier at ORNL breaks the exaflop barrier

The breakthrough: Frontier, an HPE Cray EX-architecture system at the Oak Ridge Leadership Computing Facility, achieved 1.102 EFLOPS Linpack in May 2022 and was announced as the first exaflop machine on the June 2022 TOP500 list. Architecture: 9 472 AMD EPYC 7713 “Trento” 64-core CPUs at 2.0 GHz plus 37 888 AMD Instinct MI250X GPUs across 9 408 nodes; HPE Slingshot-11 interconnect. Each node has 1 CPU + 4 MI250X. Frontier topped the Green500 at debut at 62.68 GFLOPS/W – the most energy-efficient leadership-class machine to that date. Consequence: First exascale machine in production in the West. The barrier had been a stated DOE programme goal since approximately 2008 (the Roadrunner-petaflop year). Frontier’s 14-year development from goal to delivery is a marker of the institutional cost of building a new architectural class. Frontier remained TOP500 #1 from June 2022 to November 2024. Source: Wikipedia, “Frontier (supercomputer).” TOP500 June 2022 list. HPE press release, May 2022. ORNL news release, “Frontier supercomputer debuts as world’s fastest, breaking exascale barrier.”

2022-11 to 2023-07, Pangu-Weather – ML beats physics for the first time

The breakthrough: Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian at Huawei Cloud (Shenzhen) preprinted Pangu-Weather on arXiv 4 November 2022 (arXiv:2211.02556). The paper was published in Nature 619:533-538, online 5 July 2023. The model was a 3D Earth-Specific Transformer (3D-EST) trained on 43 years of hourly ECMWF ERA5 reanalysis data, with approximately 256 million parameters. Pangu-Weather was the first ML-based global weather model to outperform the operational ECMWF Integrated Forecasting System (IFS) at all forecast lead times from 1 hour to 7 days, on all evaluated variables (geopotential, specific humidity, wind, temperature). 10 000-fold inference speedup over IFS HRES. Pangu-Weather was demonstrated in real time during the August 2023 Typhoon Khanun, accurately tracking the storm five days ahead, despite never having been trained on tropical cyclone data. Consequence: Pangu-Weather was the moment that the operational meteorology community accepted that machine-learned models could outperform physics-based models on the medium-range forecasting benchmark. Within twelve months ECMWF had begun internal experiments with Pangu-Weather and several other ML systems. Bi, Xie, Zhang and colleagues received the 2023 Wu Wen-Tsun AI Special Achievement Award. The paper has been cited over 1 500 times by mid-2026. Source: Bi, K. et al. “Accurate medium-range global weather forecasting with 3D neural networks,” Nature 619:533-538, 5 July 2023. arXiv:2211.02556. Huawei press release, July 2023. MIT Technology Review, 5 July 2023.

2023-12-22 (online 14 November 2023), GraphCast – DeepMind’s medium-range model published in Science

The breakthrough: Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, Peter Battaglia at Google DeepMind published “Learning skillful medium-range global weather forecasting” in Science 382(6677):1416-1421, with online publication 14 November 2023. The model, a Graph Neural Network operating on a 0.25° latitude/longitude mesh (approximately 28 km horizontal resolution), produced a full 10-day global forecast in under one minute on a single Google TPU v4 machine. GraphCast outperformed ECMWF’s HRES on 90% of 1 380 verification targets including tropical cyclone tracks, atmospheric rivers, and temperature extremes. The benchmark comparison: A traditional 10-day HRES forecast at ECMWF requires hours of wall-clock time on a supercomputer with hundreds of nodes; GraphCast produces the same forecast in under a minute on a single TPU. Approximately a 10 000-fold inference-time speedup. Both Pangu-Weather and GraphCast achieved similar speedups against the same baseline. Consequence: With Pangu-Weather (Nature, July 2023) and GraphCast (Science, November 2023) the meteorological community had two independent demonstrations of the ML-beats-physics result within five months. ECMWF began open-source release of GraphCast inference for public use. NOAA’s NCEP began parallel training of GraphCast on its operational GFSv16 GDAS data, with experimental real-time forecasts uploaded since 5 February 2024. Source: Lam, R. et al. “Learning skillful medium-range global weather forecasting,” Science 382(6677):1416-1421, 14 November 2023. arXiv:2212.12794 (preprint December 2022). DeepMind blog, “GraphCast: AI model for faster and more accurate global weather forecasting,” 14 November 2023. GitHub: google-deepmind/graphcast.

2024-12-04, GenCast – DeepMind’s probabilistic ensemble in eight minutes

The breakthrough: Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R. Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, Remi Lam, Matthew Willson at Google DeepMind published “Probabilistic weather forecasting with machine learning” in Nature (online 4 December 2024). GenCast is a diffusion model adapted to the spherical geometry of the Earth, producing a 50-or-more-member ensemble of 15-day global forecasts at 0.25° resolution. Single 15-day forecast trajectory: 8 minutes on a Google Cloud TPU v5; the entire 50-member ensemble runs in parallel for the same 8 minutes. Compare to ECMWF ENS, the operational baseline – approximately 8 hours of wall-clock for a 51-member ensemble on the operational HPC. GenCast outperformed ECMWF ENS on 97.2% of evaluation targets, and 99.8% at lead times above 36 hours. Consequence: GenCast is the first ML-based probabilistic ensemble forecast that decisively beats the established operational ensemble at the same forecast horizon. It establishes that ML-based forecasting can close the gap not just on deterministic medium-range forecasting (Pangu-Weather, GraphCast) but on probabilistic ensemble forecasting – the more demanding regime that supports operational decision-making. ECMWF subsequently developed its own internal AIFS ensemble model on the same architectural pattern. Source: Price, I. et al. “Probabilistic weather forecasting with machine learning,” Nature, online 4 December 2024. arXiv:2312.15796. DeepMind blog, “GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy.”

2025-02-25 / 2025-07-01, ECMWF AIFS goes operational

The breakthrough: ECMWF took its Artificial Intelligence Forecasting System (AIFS) into operations on 25 February 2025. The first operational version, AIFS Single v1.0.0, ran a single deterministic forecast at a time. AIFS 1.1.0 (corrected precipitation forecasts) was released 27 August 2025. The ensemble version, AIFS ENS, entered operations on 1 July 2025: 51 members, full 15-day forecast, generating forecasts approximately 10 times faster than the physics-based IFS ENS while consuming approximately 1/1000th the energy. AIFS ENS outperforms IFS ENS by up to 25% on upper-air variables and up to 20% on surface temperature. Consequence: ECMWF – the institutional standard-bearer of physics-based numerical weather prediction since 1979 – now operates an ML-based forecasting system in production alongside its physics-based system. The Bauer-Thorpe-Brunet 2015 “quiet revolution” framing is being superseded by an architectural revolution that is anything but quiet. Within twelve months of GenCast’s Nature publication, the ML-based ensemble has gone from research demonstration to production operational deployment at the world’s leading medium-range forecasting centre. The institutional adoption rate is, by historical standards, extraordinary – roughly the speed at which ECMWF transitioned from CDC 7600 to Cray-1 in 1976-1979. Source: ECMWF News, “ECMWF’s AI forecasts become operational,” February 2025. ECMWF Newsletter 185, “AIFS ENS becomes operational.” ECMWF AIFS Machine Learning data page. arXiv:2509.18994 (AIFS 1.1.0 paper).

2024-05-13, Aurora at Argonne – the second exaflop machine, nine years late

The breakthrough: Aurora at Argonne National Laboratory entered the May 2024 TOP500 list at 1.012 EFLOPS Linpack (using 9 230 of 10 624 nodes – 87% of the system). Architecture: HPE Cray EX-architecture, Intel Xeon Max CPUs (Sapphire Rapids with HBM) plus Intel Data Center GPU Max (“Ponte Vecchio”) accelerators. Aurora is the second exaflop machine and the fastest AI system in the world dedicated to AI for open science (10.6 AI exaflops mixed-precision). The history: Aurora’s contract was first signed in 2015 with Intel as prime contractor and a 180-200 PFLOPS Knights-Hill-based system planned for 2018. After the 2017 Knights Hill cancellation, the contract was rewritten – new performance target above 1 EFLOPS, deployment slipped to 2021, architecture changed to Sapphire Rapids + Ponte Vecchio. Further delays through 2021-2023 around the Ponte Vecchio GPU launch pushed deployment to 2023 and the exaflop submission to May 2024. Total elapsed time from contract to operational exaflop: nine years. Consequence: Aurora’s late delivery is the cleanest example of how the death of Knights Hill cascaded across a decade. By the time it ran, AMD had already delivered Frontier at higher Linpack and at lower power. Intel’s GPU-compute strategy (Ponte Vecchio, the now-discontinued Rialto Bridge follow-on, and the smaller Falcon Shores) has not produced a clear win in the leadership-class HPC market since the original 2015 Aurora contract. Aurora ranks #2 on TOP500; Frontier remains #1. Source: Wikipedia, “Aurora (supercomputer).” HPE press release, 13 May 2024. Argonne ALCF, “Argonne’s Aurora supercomputer breaks exascale barrier.” HPCwire, “Some Reasons Why Aurora Didn’t Take First Place in the Top500 List,” May 2024. The Register, “Aurora becomes US’s second exaFLOPS super behind Frontier.”

What this means for Post 36

The cluster-and-GPU era is the only architectural era in this story where the failures and the successes are inseparable. Cell B.E. failed but proved the hybrid CPU-plus-accelerator architectural pattern. Larrabee failed but became the (also-failed) Xeon Phi line, which forced DOE to procure GPUs in volume, which financed NVIDIA’s CUDA build-out, which was the substrate of AlexNet, which made deep learning the dominant ML methodology, which produced Pangu-Weather and GraphCast, which now run side by side with the IFS at ECMWF.

The 2008 NVIDIA financial crisis is the load-bearing pivot of the entire era. If Jensen Huang had cut CUDA investment in late 2008 the way his investors demanded, AlexNet in 2012 would have run on something else (or not at all), and the ML-weather revolution would have been delayed by approximately a decade. The corporate-strategic decision in late 2008 produced, by direct causal chain, the 2025 ECMWF AIFS production deployment.

Bulldozer cost AMD a decade. Itanium cost Intel and HP twenty billion dollars. Larrabee/Knights cost Intel another decade and the GPU-compute market. The Cell B.E., Sun’s UltraSPARC T-series, and the SGI Cray acquisition each carry their own “right idea, wrong execution” lessons. The cluster-and-GPU era is also where the architectural verdict from 1972 (ILLIAC IV’s SIMD philosophy) was finally vindicated: Slotnick’s bet won the long game, fifty years later, on G80.

The Bauer-Thorpe-Brunet “quiet revolution” framing held from 2015 to roughly 2023. With Pangu-Weather, GraphCast, GenCast, and the operationalisation of AIFS, the revolution is no longer quiet. The nine-month-of-skill horizon that Shukla’s 1981 monthly-mean predictability framework promised has, in essence, arrived – not on a Cray, not on a Beowulf cluster, but on a Google TPU pod running a diffusion model trained on forty years of ECMWF reanalysis data.

That is the punchline Post 36 has to land: the questions Richardson asked in 1922 are now being answered, every six hours, by neural networks running on hardware that the ILLIAC IV team would have recognised, on data that ECMWF began collecting in 1979, by institutional combinations (DeepMind+ECMWF, Huawei+China Meteorological Administration, NVIDIA+NCAR) that would have been unthinkable in any prior architectural era.

Michał Brennek