Meta's Next Generation AI Data Center Infrastructure

< Back to insights

Meta's Next Generation AI Data Center Infrastructure

August 1, 2024

Meta is significantly investing in its AI infrastructure, focusing on custom data centers and processors optimized for AI workloads. This strategy includes custom AI inference chips, innovative data center designs, and one of the fastest AI supercomputers.

MTIA: Meta's Custom AI Inference Chip

Meta's latest AI inference chip, MTIA v2, has doubled the compute and memory bandwidth of its predecessor. Designed to efficiently handle ranking and recommendation models, MTIA enhances user experience across Meta's platforms. By controlling the entire stack from chip design to software, Meta achieves greater efficiency compared to commercially available GPUs. MTIA is now operational in Meta's data centers, serving models in production.

Next-Gen AI-Optimized Data Centers

Meta's new data center design supports liquid-cooled AI hardware and a high-performance AI network, connecting thousands of AI chips for large-scale training. These centers will be faster and more cost-effective to build, complementing other new hardware like Meta's MSVP ASIC for video workloads.

Research SuperCluster (RSC) AI Supercomputer

Meta's RSC, with 16,000 GPUs, ranks among the world's fastest AI supercomputers. It's built to train large AI models for AR tools, content understanding, translation, and more.

Custom Silicon Roadmap

Meta's custom silicon roadmap includes expanding the MTIA scope to support generative AI workloads and developing a sophisticated processor for AI training, similar to Nvidia's H100 GPUs. By 2024, Meta plans to have 350,000 Nvidia H100 GPUs in its data centers to support its AI ambitions.

Meta's Data Center Expansion Plans

Meta is expanding its data center infrastructure to meet the growing demands of its AI initiatives and platforms.

Current Data Center Landscape

Meta operates 21 data center campuses worldwide, with a total investment exceeding $16 billion. These facilities span over 40 million square feet and support services like Facebook, Instagram, and WhatsApp. Meta's data centers focus on efficiency and sustainability, utilizing 100% renewable energy and aiming for water-positive operations and net-zero emissions by 2030.

Future Expansion Plans

Meta plans to double its data center buildings from 80 to 160 by 2028 to accommodate the anticipated fourfold increase in AI compute needs over the next decade. New designs will optimize for AI workloads, implementing liquid cooling systems to manage the heat generated by high-density AI chips like MTIA.

Impact on Local Economies

Meta's data center expansions significantly impact local economies through job creation, economic diversification, and infrastructure development.

Job Creation

The new Montgomery, Alabama data center, for example, will create around 100 permanent operational jobs and over 1,000 construction jobs, benefiting local contractors, laborers, and suppliers.

Economic Diversification

Meta's investments promote a shift towards a knowledge-based economy, attracting further investments in technology and related sectors, enhancing the local economic landscape.

Infrastructure Development

State-of-the-art data centers often lead to local infrastructure improvements, including roads and utilities, benefiting residents and other businesses.

Community Engagement

Meta engages with local communities through funding for schools, nonprofits, and community projects, enhancing community resources and supporting local development initiatives.

Overview of Meta's Data Centers

Meta's data centers are pivotal to its AI initiatives, featuring advanced GPU infrastructure, custom hardware, and innovative cooling technologies.

Building and Size

Meta operates 21 data center campuses globally, planning to double the number of buildings to 160 by 2028. New centers are designed to optimize space for high-density GPU deployments.

GPU Infrastructure

Meta is investing in 350,000 NVIDIA H100 GPUs by the end of 2024, planning to scale up to 600,000 H100 equivalents. Recent clusters include two 24,576-GPU setups, supporting advanced AI models like Llama 3.

Custom Hardware

Meta uses Grand Teton, its in-house designed open GPU hardware platform, integrating power, control, compute, and fabric interfaces into a single chassis. The YV3 Sierra Point server platform features high-capacity E1.S SSDs for efficient data handling.

Cooling Technologies

Meta implements a hybrid cooling approach with air and liquid cooling systems, crucial for managing heat from high-density GPU configurations. The new design incorporates direct-to-chip liquid cooling for AI training servers.

Network Infrastructure

Meta's data centers feature advanced networking solutions, including 400 Gbps endpoints and various network fabrics, ensuring low latency and high bandwidth for AI workloads.

Storage Solutions

Meta employs a Linux Filesystem in Userspace API backed by its Tectonic distributed storage solution, with a parallel network file system developed in collaboration with Hammerspace.

Ensuring Reliability and Efficiency in Meta's Large-Scale AI Clusters

Meta's strategy to ensure the reliability and efficiency of its large-scale AI clusters involves custom hardware, optimized networking, advanced cooling technologies, and a commitment to sustainability.

Custom Hardware and Architecture

Meta develops large GPU clusters, such as the recent 24,576-GPU configurations, built on the Grand Teton platform. The AI Research SuperCluster (RSC), featuring 16,000 NVIDIA A100 GPUs, accelerates AI research and development.

Optimized Networking Solutions

Meta employs advanced networking technologies, including RDMA over Ethernet and NVIDIA Quantum2 InfiniBand fabric, interconnecting 400 Gbps endpoints to ensure low latency and high throughput.

Software and Frameworks

Meta continuously evolves PyTorch, its foundational AI framework, and develops benchmarking tools like ROCET and PARAM to maintain performance across distributed AI workloads.

Thermal and Energy Management

Meta uses a digital thermal simulator to optimize cooling strategies and achieve a Power Usage Effectiveness (PUE) of 1.09. The company also employs direct evaporative cooling and StatePoint Liquid Cooling (SPLC) systems.

Sustainability Practices

All of Meta's operational data centers are powered by 100% renewable energy, aiming for net-zero carbon emissions. Meta also implements strategies to reduce water usage by over 50% in some facilities.

Continuous Improvement and Innovation

Meta's infrastructure features a feedback mechanism for continuous monitoring and optimization. The company invests in R&D to explore new technologies and methodologies for future AI advancements.

Meta's comprehensive approach to its AI infrastructure combines cutting-edge hardware, optimized networking, advanced cooling technologies, and sustainability. By continuously innovating, Meta will be able to support the growing demands of its AI research and application development.

Partner with Vertical Data for Cutting-Edge Data Center Solutions

As Meta advances its ambitious AI infrastructure, efficient and scalable data center solutions are crticial. Vertical Data offers comprehensive services from hardware procurement and high-density colocation to turn-key GPU as a Service (GPUaaS). Our expertise in global markets and technology infrastructure accelerates customer growth, bridging the AI adoption gap. Our testing lab ensures efficient deployment with full certification, while our role as a supplier of Nvidia and AMD products enables us to source best-in-class servers and hard-to-find components.

Vertical Data’s solutions include seamless GPUaaS deployment for large-scale computing, supported by infrastructure services for "out of the box" readiness. We offer financing and lease-back solutions to operationalize CAPEX, allowing faster growth. Our data center services handle current densities (45-72KW/Cab) and prepare for next-gen densities up to 170KW/Cab, with expertise in high-density configurations ensuring readiness for future infrastructure needs.

Partnering with Vertical Data provides unmatched service and capabilities in today's competitive market. Our "Service First" philosophy ensures effective solutions across sectors, including enterprise data centers, OEMs, system integrators, military, aerospace, and communications. With our resources, you can accelerate growth, optimize AI infrastructure, and increase revenue in the compute market. Schedule a call today to explore how Vertical Data can power your compute capabilities and transform your role in the data center ecosystem.