Google Stand Up Exascale TPUv4 Pods In The Cloud

It’s Google I/O 2022 this week, among other things, and we’re hoping for an in -depth architectural dive into the TPUv4 matrix math engines that Google mentioned at last year’s I/O event. But, alas, no such luck. But the search engine and advertising giant, which also happens to be one of the biggest AI innovators on the planet due to very large amount of data it needs to use, provided some additional information about TPUv4 processors and the systems that use them.

Google also said it is installing eight pods of TPUv4 systems in its Mayes County, Oklahoma, data center, covering 9 exaflops of integrated computing capacity, for use by the Google branch. Its cloud so that researchers and businesses have access to the same type and computing capacity that Google needs to do its own in-house AI development and production.

Google has operated data centers in Mayes County, northeast of Tulsa, since 2007 and has invested $ 4.4 billion in the facilities since then. It is located in the geographical center of the United States-somewhat south and west of it-and makes it useful due to its relatively short latency for most of the country. And now, by definition, Mayes County has one of the largest steel rigs to boost AI workloads on the planet. (If all eight TPUv4 pods were grouped together on the network and the work could be done simultaneously, we could probably say “biggest” without a doubt…. Google certainly did, as you can see in the quote on below.)

During his keynote address, Sundar Pichai, chief executive of Google and its parent company, Alphabet, mentioned that TPUv4 pods were previewed in his cloud.

google io mayes county ai hub

“All of the advances we’ve shared today are only possible through constant changes to our infrastructure,” Pichai said of some pretty cool natural language enhancements and immersive search engine data he’s made and feeds all types of applications. “Recently, we announced our intention to invest $ 9.5 billion in data centers and offices throughout the United States. One of our innovative data centers is in Mayes County, Oklahoma, and I’m excited to announce that we’ll be launching the largest publicly accessible machine learning center there for all of our Google customers. Cloud. This machine-facing hub features eight Cloud TPU v4 pods, custom-built on the same network infrastructure that powers Google’s largest neural model. They deliver nearly 9 exaflops of integrated computing power, giving our customers unprecedented ability to run complex models and workloads. We hope this will drive change in everything from medicine to logistics to maintenance and beyond. »

Pichai added that this TPUv4 pod-based AI hub already has 90% of its power coming from sustainable, carbon-free resources. (He did not say how much wind, solar or hydropower there is.)

Before we get into the speed and flow of TPUv4 chips and pods, it’s probably worth pointing out that, for all we know, Google already has TPUv5 pods in its internal data centers, and it could have a much larger collection of TPUs to self -drive. models and learn own applications using algorithms and AI tasks. This is the old way Google did it: Talk about generation NO something while he was selling the generation N-1 and has been passed down through generations N+1 for its internal workload.

This does not seem to be the case. In a blog post written by Sachin Gupta, VP and GM of Infrastructure at Google Cloud, and Max Sapozhnikov, Product Manager for Cloud TPUs, when TPUv4 systems were developed last year, Google provided early access to researchers from Cohere, LG AI Research. Specifically, PaLM was developed and tested on two TPUv4 pods, each with 4,096 TPUv4 matrix math engines.

If Google’s brightest new models were built on TPUv4s, it probably wouldn’t have a fleet of TPUv5s hiding in a data center somewhere. Although we will add, it would be interesting if the TPUv5 machines were hidden, 26.7 miles southwest of our office, in the Lenoir data center, shown here from our window:

scaled google lenoir datacenter

The strip of gray down the mountain, under the birch leaves, is the Google datacenter. If you squint and stare intently into the distance, the Maiden Apple Data Center leaves to the left and farther down the line.

That’s enough. Let’s talk about some flow and speed. Here, finally, are some capabilities that compare TPUv4 to TPUv3:

google specs io tpuv4

Last year when Pichai suggested TPUv4, we predicted that Google would move to a 7 nanometer process for this generation of TPU, but due to the very low power consumption, it seems likely to be etched using the 5 nanometer process . (We assume that Google is trying to keep the power envelope constant, and it clearly wants to lower it.) We also guessed that it doubled the number of cores, from two cores in TPUv3 to four cores in TPUv4, which has not been confirmed by Google. or rejected.

Doubling performance while doubling the cores will allow TPUv4 to reach 246 teraflops per chip, and going from 16 nanometers to 7 nanometers will allow nearly doubling the same power envelope with roughly the same speed. clock. Going 5 nanometers allows the chip to be smaller and run relatively faster while reducing power consumption – and having a smaller chip with potentially higher efficiency while the 5 nanometer ones the process will get old. The average power consumption was reduced by 22.7%, and it met with an 11.8% increase in clock speed when considering process node two and change from TPUv3 to TPUv4.

There are some interesting things in this table and in the statements that Google makes on this blog.

Aside from the 2X core and slight increase in clock speed caused by the chip making process for TPUv4, it is worth noting that Google kept the memory capacity at 32GB and did not increase the HBM3 memory used by Nvidia in the “Hopper” GH100 GPU accelerators. Nvidia is obsessed with memory bandwidth across devices and, by extension to its NVLink and NVSwitch, memory bandwidth across nodes and now on nodes with a maximum of 256 devices in a frame.

Google is less concerned about memory atoms (as far as we know) in proprietary TPU interconnect, device memory bandwidth, or device memory capacity. The TPUv4 has the same 32GB capacity as the TPUv3, it uses the same HBM2 memory, and it only has a 33% increase in speed at just under 1.2TB/sec. What Google is interested in is the bandwidth in the TPU pod interconnect, which is moving to a 3D torus design that tightly combines 64 TPUv4 chips with “wraparound connections” – which is not possible with the 2D torus interconnect used in TPUv3 pods. Increasing the torus interconnect dimension allows more TPU to be carried on a tighter subnet for collective operations. (Which asks, why not a 4D, or 5D, or 6D torus then?)

The TPUv4 Pod has 4 times the TPU chips, at 4,096, and has twice the TPU cores, which we estimate at 16,384; We believe that Google keeps the number of math units in the MXU matrix at two per core, but that’s just a hunch. Google can keep the same number of TPU cores and duplicate the MXU units and get the same raw performance; the difference is the amount of front-end scalar/vector processing that needs to be done on those MXUs. Anyway, in the 16-bit BrainFloat (BF16) floating-point format produced by Google Brain Unit, the TPUv4 pod delivers 1.1 exaflops, compared to just 126 petaflops on the BF16. This is an 8.7x raw compute factor, compensated by a 3.3x increase in total bandwidth reduction in the pod and a 3.75x increase in bi-section bandwidth through TPUv4 interconnect through the pod.

This blog post intrigued us: “Each Cloud TPU v4 chip has ~ 2.2x higher FLOP than Cloud TPU v3, for ~ 1.4x higher FLOP per dollar. If you do the math on that statement, that means the price of renting TPU on Google Cloud has increased by 60% using TPUv4, but it’s 2.2x the job. These price and performance increases are fully in line with the kind of price/performance improvement Google expects from the ASIC switches it buys for its data centers, which typically offer 2x the bandwidth. for 1.3 times to 1.5 times the value. TPUv4 is a bit more expensive, but it has a better network to run larger models, and that’s worth it as well.

TPUv4 Pods can run on Google Cloud virtual machines from four chips up to “thousands of chips”, and we assume that means an entire Pod.

Leave a Comment