Google Announces Cloud TPU Virtual Machines for AI Workloads, Designed to Run Machine Learning Models with AI Services in the Google Cloud

Early last year, TPU VMs in Google Cloud were introduced to facilitate the use of the Tensor Processing Unit (TPU) by providing direct access to TPU host machines. Today, we are pleased to announce the general availability (GA) of TPU VMs, “said Google. The general availability of Cloud TPU virtual machines means that users no longer have to remotely access Cloud TPU.

Machine learning has enabled successes in business and research, from network security to medical diagnostics. Google created the TPU to allow anyone to make similar advances. TPU is the custom machine learning ASIC that powers Google products such as Translate, Photos, Search, Assistant, and Gmail.

Cloud TPU is designed to run innovative machine learning models with Google Cloud AI services. And its custom high-speed network delivers more than 100 ptaflops of performance in one pod. Enough computing power to transform a business or create the next research success.

Direct access to TPU virtual machines has completely changed what we do with TPUs and dramatically improved the developer experience and model performance, ”Aidan Gomez, Co-founder and CEO, Cohere.

With VM-enabled TPU clouds, it is possible to work interactively on both hosts to which the physical TPU hardware is attached. Our growing TPU user community has enthusiastically embraced this access mechanism, as it not only allows for a better debugging experience, but also allows for several configurations of training, such as reinforcement learning. distributed, which cannot be achieved using the TPU Node (accessed networks) architecture.

What’s new in the GA version

Cloud TPUs are now optimized for large recommendation workloads.

  • TPUs can offer faster learning speeds and significantly lower learning costs than CPUs for recommended system models;
  • TensorFlow for Cloud TPUs provides a great API for handling large embedding tables and quick searches;
  • With the TPU v3-32 wafer, Snap was able to achieve ~ 3x higher throughput (-67.3% throughput on the A100) at 52.1% lower cost compared to an equivalent A100 configuration (~ 4.65x perf/TCO ).

Rankings and recommendations

With the release of TPU VMs GA, Google introduced the new TPU Embedding API, which can speed up ML -based ranking and recommendation workloads. Many businesses today are built by ranking and recommendation use cases, such as audio/video recommendations, product recommendations (apps, e-commerce), and ad rankings. These companies rely on ranking and recommendation algorithms to serve their users and achieve their business goals.

Over the past few years, the approaches to these algorithms have evolved from a purely statistical approach to an approach based on deep neural networks. These modern algorithms based on deep neural networks offer better scalability and greater accuracy, but they have value. They tend to use large amounts of data and can be difficult and costly to train and deploy using traditional ML infrastructure.

Integrating acceleration with Cloud TPU can solve this problem at a lower cost. Integration APIs can efficiently process large amounts of data, such as integration arrays, by automatically distributing data across hundreds of Cloud TPU chips in one pod, all connected to each other via custom and interconnect. To help users get started, Google publishes the TF2 Ranking and Recommending API, as part of the Tensorflow Recommenders library.

Framework support

The GA version of the TPU VM supports three main frameworks (TensorFlow, PyTorch and JAX) which are now offered through three environments optimized for easy configuration in their respective frameworks. The GA version is validated using TensorFlow v2-tf-stable, PyTorch/XLA v1.11 and JAX [0.3.6].

Specific characteristics of TPU VMs

TPU VMs provide some additional capabilities to the TPU Node architecture through local runtime configuration, i.e., TPU hardware connected to the same host where users run the training workload (s).

Local implementation of the input pipeline

The input data pipeline runs directly to the TPU hosts. This feature saves valuable computing resources previously used as example groups for distributed PyTorch/JAX training. In the case of Tensorflow, the distributed training setup requires only one user VM and the data pipeline is run directly to the TPU hosts. The following study summarizes the cost comparison for training Transformer (FairSeq; PyTorch/XLA) running for 10 periods on a TPU VM versus TPU Node (networked TPU) architecture.

Over the past two years, Kakao Brain has developed many groundbreaking AI services and models, including minDALL-E, KoGPT, and more recently, RQ-Transformer. We’ve been using the TPU VM architecture since it launched in Google Cloud, and we’ve seen significant performance improvements compared to the original TPU node configuration. We are excited about the new features added to the generally available version of the TPU VM, such as the Embeddings API, ”said Kim Il-doo, CEO of Kakao Brain.

Distributed Reinforcement Learning using TPU VMs

Running locally on the host using the accelerator also enables use cases such as distributed reinforcement learning. Classic works in this field, such as seed-RL, IMPALA, and Podracer, are built using cloud-based TPUs.

…, we argue that the computational requirements of large -scale reinforcement learning systems are particularly suited to the use of TPU Clouds, and more specifically TPU Pods: special setups in a Google data center featuring multiple TPU devices that interconnected low-latency communication channels, says DeepMind’s Podracer.

Support for custom operations for TensorFlow

With direct implementation in the TPU VM, users can now build their own custom operations such as TensorFlow Text. With this feature, users are no longer tied to TensorFlow runtime versions.

Source: Google

And you?

What is your opinion on the subject?

See also:

Google Duplex, the AI ​​system for telephone exchanges, is so realistic that Google considers it to show up at the beginning of a conversation

Google wants to work with the Pentagon again, despite employee concerns. The company has reportedly offered to become a military cloud provider

AI: Google will not renew its contract with DoD on controversial Maven project, its reputation at stake

Leave a Comment