< Back to insights

Hyperscaler AI Custom Chips (ASIC), CPUs, and Networking: Breaking Down the Supply Chain

May 16, 2024

The AI ASIC Ecosystem

Hyperscalers are more frequently developing their own AI Application-Specific Integrated Circuits (ASICs) to reduce reliance on third-party vendors like NVIDIA and optimize for specific workloads. This is being driven by the need for technological independence, enhanced performance, cost reduction, and energy efficiency. Companies like Google, Microsoft, Amazon, Meta, and Bytedance are investing heavily in AI ASIC development to meet their scaling needs and drive future innovations.

Pros of AI ASICs

  • Optimized Performance: Tailored for specific tasks, offering improved efficiency.
  • Cost Reduction: Lower operational costs by avoiding third-party hardware.
  • Energy Efficiency: Custom ASICs consume less power, crucial for large data centers.
  • Enhanced Security: Direct control over hardware enhances data security and privacy.

Cons of AI ASICs

  • High Initial Costs: Significant investment in design and manufacturing.
  • Inflexibility: Hard to reprogram or adapt to new technologies.
  • Long Development Time: Delays from design to deployment can be substantial.
  • Obsolescence Risk: Rapid technological advances can quickly make custom ASICs outdated.

Google's TPU (AI ASIC)

Google's Tensor Processing Units (TPUs) are the cornerstone of their AI strategy. These specialized processors are crucial for running and training complex machine learning models, including Google's Gemini model, and are integral to the services offered through Google Cloud Platform (GCP). Google's TPUs have evolved from designs focused on integer operations to the latest TPU v4 models supporting floating-point calculations and achieving exascale performance.

Suppliers: Google's TPU AI ASIC processor family has been supplied and designed alongside Broadcom across multiple generations, at 7nm, 5nm, and 3nm. Broadcom also equips Google's networking initiatives with the Marvell PAM DSP and Astera PCIe retimer, enhancing high-speed data transfer and network reliability.

Microsoft's Maia (AI ASIC) for Azure's AI Infrastructure

Microsoft's Maia AI Accelerator is a cornerstone of Azure's AI infrastructure, designed to enhance cloud-based AI services. Developed on a 3nm process by TSMC and designed alongside Marvell, Maia 100 is tailored to handle demanding AI tasks efficiently. It features advanced cooling solutions and custom power distribution systems to meet the intensive power and thermal management needs of AI computations.

Suppliers: Microsoft's Maia AI ASIC processor family is developed using Marvell's 3nm technology. Additionally, Microsoft's networking infrastructure benefits from Marvell's PAM DSP, Broadcom's PCIe switching, and Astera's PCIe retimer solutions.

Amazon's Trainium and Inferentia (AI ASIC)

Amazon began its journey into custom AI ASIC development with the creation of the Trainium and Inferentia series. Trainium chips excel at training deep learning models efficiently and cost-effectively, while Inferentia chips are optimized for inference tasks. However, Amazon faces challenges transitioning AI workloads from NVIDIA's CUDA platform, requiring significant software development and ecosystem support.

Suppliers: Amazon collaborates with Marvell for its 3nm Maia AI ASIC processor family and employs Marvell's PAM DSP, Broadcom's PCIe switching, and Astera's PCIe retimer for networking needs.

Big 3 Cloud CPUs: ARM Architecture

The Big 3 cloud providers—AWS, Azure, and Google Cloud (GCP)—collectively utilize an estimated 50.4 million CPUs in their infrastructure, predominantly based on ARM architecture. This choice is driven by the cost efficiency and power consumption benefits of ARM processors compared to traditional Intel or AMD CPUs. Estimates suggest that using ARM-based processors can offer significant cost advantages, making them an attractive option for these cloud giants.

Meta's MTIA (AI ASIC) for Powering the Metaverse

Meta is enhancing its processing capabilities through custom AI ASIC processors, utilizing Broadcom's advanced semiconductor technologies. The partnership with Broadcom, as evidenced by Broadcom's CEO joining Meta's board, aims to advance Meta's technological infrastructure. Additionally, Meta leverages Marvell's PAM DSP, Broadcom's PCIe switching, and Astera's PCIe retimer technologies for its networking needs.

Bytedance's AI Video/AI Networking (AI ASIC) for Video Processing

Bytedance is developing custom ASICs for AI video and networking using Broadcom's 5nm and 3nm technologies. This initiative aims to enhance platforms like TikTok by improving video processing capabilities, reducing dependency on external suppliers, and enhancing control over its hardware ecosystem. By optimizing processing speed and power consumption, Bytedance aims to boost user experience and operational efficiency on its data-heavy platforms.

More Competition Ahead

Hyperscalers are driving innovation across data centers, reducing reliance on third-party vendors, and optimizing their infrastructure for specific workloads. This shift towards custom silicon is shaping the future of AI and cloud computing, with significant implications for performance, efficiency, and cost-effectiveness.

So how can you keep up? 

Fuel Your AI Infrastructure Growth with Vertical Data

Vertical Data, a leading independent distributor of data center infrastructure solutions, including NVIDIA GPUs, can help you compete with data center infrastructure.

Why Choose Vertical Data?

  • Rapid Access to Cutting-Edge Hardware: Source hard-to-find equipment, including the latest NVIDIA GPUs, in days, not months.
  • Streamlined Procurement: Minimize red tape and accelerate your infrastructure deployment.
  • Financial Flexibility: Innovative financing solutions to bridge the compute demand-supply gap.
  • Unparalleled Expertise: Decades of experience and a deep understanding of global markets.
  • World-Class Support: Unrivaled customer service and technical expertise.

Partner with Vertical Data to power your compute infrastructure and unlock the full potential of your data center.