xAI cluster is now the world’s most powerful AI training system, but questions remain about storage capacity, power consumption, and why it’s even called Colossus
We recently got a glimpse of what’s being done with $1 billion worth of AI GPU technology when Elon Musk shared a quick video tour of Cortex, the X’s AI training supercomputer currently under construction at Tesla’s factory in Giga, Texas.
Musk recently announced on his social media platform that Colossus, a new 100k H100 training cluster, is now operational.
Musk claims that Colossus is “the world’s most powerful AI training system” and that it was built “from start to finish” in just 122 days. That’s quite a feat. Servers for the xAI cluster were reportedly supplied by Dell and Supermicro, with the project estimated to cost $3-4 billion.
This weekend, the @xAI team launched our Colossus 100k H100 training cluster. From start to finish, it took 122 days. Colossus is the most powerful AI training system in the world. Plus, it will double in size to 200k (50k H200s) in a few months. Excellent…September 2, 2024
Where does the name Colossus come from?
Tom’s Hardware notes: “While all of these clusters are formally operational and even training AI models, it’s completely unclear how many are actually online today. For one, it takes time to debug and optimize the settings of those superclusters. Second, X needs to make sure they have enough power, and while Elon Musk’s company used 14 diesel generators to power its Memphis supercomputer, they still weren’t enough to power all 100,000 H100 GPUs.”
The Colossus system is set to eventually double in capacity, with plans to integrate an additional 100,000 GPUs – 50,000 H100 units and 50,000 of Nvidia’s next-gen H200 chips. The supercluster will primarily be used to train xAI’s Grok-3, the company’s latest, most advanced AI model. We haven’t seen any mention of storage for the new system yet, but it’s going to have to be massive.
The naming of the new supercomputer has raised a few eyebrows, however, with people noting that the computer shares its name with a 1970s science fiction film (based on a 1966 novel by D.F. Jones) about a supercomputer that becomes sentient after taking control of the U.S. nuclear arsenal. Things, unsurprisingly, go horribly wrong for humanity.
Both the novel and the film explore timely themes of AI autonomy, the dangers of handing control over to machines, and the ethical implications of artificial intelligence. It’s possible that Musk wasn’t aware of this when choosing the name for his new AI training system, and that it was chosen purely to emphasize the sheer size of the supercluster. Then again, given Musk’s track record, it wouldn’t be surprising if the reference was entirely intentional—he knows exactly what he’s doing.