#99 CUDA Shoulda Gotten Some Competition

The Case of CUDA's Missing Competitors

and

Sep 09, 2024

Today, Satya Sahu apologises for the blatant pun in the post title before using CUDA as a case study to discuss the inextricable manner in which software development ecosystems determine the success of AI accelerators like GPUs.

Also, we are hiring for the role of Research Analyst (Pakistan Studies). If you are keen on this opportunity, apply here!

Cyberpolitik: The Case of CUDA’s Missing Competitors

— Satya Sahu

This is an excerpt from an upcoming Takshashila Discussion Document where we demystify AI Hardware. As it turns out, AI Hardware cannot be easily divorced from the AI software stack.

Nvidia’s Software Moat is Comfortably Fortified (credit)

The development and proliferation of specialised hardware have significantly accelerated the evolution of AI as a scientific field. However, their success and adoption in the market are not solely determined by the raw processing power of the accelerators. Much like how consumer PCs and smartphones are dependent on their operating systems, the performance of AI processing units can only be realised through the software ecosystems that support their deployment. Software frameworks, libraries, and programming languages harness the processing capabilities of accelerators and provide high-level APIs (Application Programming Interfaces) that simplify the development process for models and applications that run on them.

As such, whether or not an AI accelerator meets with widespread success and adoption in the industry is heavily dependent on the maturity and ease of use of compatible software ecosystems. The AI software ecosystem can be broadly broken down into AI frameworks, programming languages, and programming platforms.

The term “AI Frameworks” generally encapsulates the pre-made tools and libraries that developers can use to create, train, and test AI models. Frameworks relieve developers of the need to be minutely aware of the complexities of managing the hardware’s low-level operations (like memory management) and are usually hardware-agnostic – which means they can run on CPUs and GPUs, as well as other specialised accelerators.

That said, many frameworks optimise for specific chip architectures. Prominent examples of AI frameworks include TensorFlow, PyTorch, and MXNet. TensorFlow is heavily optimised for Google’s TPUs, allowing it to excel at matrix multiplication algorithms that are fundamental to many AI workloads. AMD/Xilinx’s Vitis AI is designed to simplify the deployment of AI inference workloads on Xilinx FPGAs.

To illustrate this further, an AI research project might prefer PyTorch due to its popularity in the academic community and support for a wide range of Nvidia’s GPUs (that have been a mainstay in the scientific computing community); however, a production-focused project might choose TensorFlow for its robust integration with Google’s cloud services.

The framework chosen for an AI workload, therefore, influences the choice of cloud service platforms as well as the underlying choice of accelerators being offered by the cloud platform.

Programming languages (such as Python and Julia) used in AI development serve as the interface between developers and AI frameworks. Python, in particular, has become the de facto standard for AI developers since it is simple and easy to learn and has a mature and extensive ecosystem of libraries useful for scientific computing.

As mentioned earlier, high-level languages and frameworks aim to be hardware-agnostic, but developers often rely on lower-level, accelerator-specific features to achieve optimal performance. This is where a bespoke and comprehensive software development platform can come into play. Parallel computing platforms and programming models like Nvidia’s proprietary CUDA (Compute Unified Device Architecture) are a prime example.

CUDA encapsulates a suite of software tools, libraries, and APIs specifically designed for Nvidia GPUs. It provides a familiar programming interface to developers using common languages like C, C++, and Fortran, and allows them to write code that can directly access the massive parallelism afforded by the GPU to greatly speed up general-purpose computing tasks.

That said, AI developers typically interact with CUDA indirectly through frameworks like TensorFlow or PyTorch. Since most popular AI frameworks are open-source, this creates an interesting situation where developers seeking to use them to benefit from the open-source benefits of community contributions, transparency, and rapid innovation must rely on a proprietary hardware and software platform to achieve peak performance.

source: Nvidia

The Curious Case of CUDA and Absent Competition

CUDA was developed to address the challenges in programming GPUs for general-purpose computing tasks. GPUs could potentially accelerate heavily parallelised workloads (graphics rendering was just such a task), but before CUDA, programming for them required low-level coding skills and a deep understanding of the underlying chip architecture.

The Nvidia 8800GTX: The first GPU with CUDA cores (source)

Nvidia tackled this problem in two ways. First, it introduced a GPU chip design architecture comprised of smaller programmable units, generally termed in the industry as “shader units.” Second, it created the CUDA software development platform, which specifically allowed coders to write programs for these units (now referred to as “CUDA cores”) on its GPUs.

The platform was designed to attract developers by advertising the massive parallel computing power of GPUs on the back of very little in the way of learning barriers, by highlighting its similarities with other common programming languages. In a nutshell, CUDA, as a software platform, is inextricably integrated with the silicon-level hardware architecture. This closed CUDA-GPU integration means that potential competitors are prevented from leveraging the CUDA platform, as Nvidia’s hardware architecture IP remains proprietary.

CUDA itself is free to use, and Nvidia invested in optimising different sub-platforms of CUDA meant for specific use cases in industry and research, such as Robotics, Machine learning, Data centres etc. The commonality afforded by the platform ensured that applications across a wide range of domains would also be compatible with all Nvidia GPUs. Nvidia invested heavily in training courses and outreach in this regard, ensuring that both academia and industry adopted its GPUs for their needs.

The CUDA ecosystem has, therefore, created two-sided network effects stemming from both developers (supply) and industry (demand) utilising the same GPUs and software platform. While CUDA and Nvidia GPUs were initially adopted by the scientific computing community gradually, they are now essential tools in the expansion of AI and deep learning fields. As it has matured, it has created a virtuous feedback loop of innovation and adoption, with an increasing number of developers using CUDA, in turn leading to more optimised libraries and tools and further reinforcing Nvidia’s market position.

CUDA's exclusivity has been a key factor in Nvidia’s dominance in the AI hardware market. The closed integration of the software development ecosystem with the hardware has enabled Nvidia to charge supra-competitive prices for its GPUs across both gaming and enterprise sectors.

source: Statista

Whither Competition?

Several alternative software ecosystems to CUDA exist; however, these have struggled to match CUDA’s maturity and performance, stemming from Nvidia’s first-mover advantage and the network effects created by its large user base. The most prominent competitor to CUDA is AMD’s Radeon Open Compute (ROCm).

ROCm is a platform designed for use with AMD’s GPUs. Similar to CUDA, it provides a suite of software tools and libraries to developers. While relatively new and lacking in overall support, ROCm has two key benefits: first, it includes an abstraction layer, HIP (Heterogeneous-Compute Interface for Portability), that allows developers to convert CUDA applications into source code that can then easily run on AMD GPUs in a short timeframe.

Second, its open-source nature potentially allows for long-term developer buy-in and crowdsourced additions to its range of libraries. These two factors offer a major value proposition for developers and organisations concerned about vendor lock-in. As of now, AMD’s GPUs, such as the Instinct MI300X, cost substantially less than Nvidia’s flagship offerings, such as the H100.

However, market trends suggest that the uptake of AMD GPUs has been primarily due to the general global demand for AI compute (which Nvidia’s production runs cannot fulfil) rather than an objectively superior software development ecosystem. This remains the case despite evidence that AMD GPUs can offer better performance-per-watt at certain AI workloads.

Source

Translation layers

Translation layers are software that allow code written for a particular hardware architecture to run on a different architecture. They essentially “translate” this code between disparate systems and, therefore, enable application compatibility across GPUs from different vendors.

As mentioned earlier, AMD’s HIP can be considered a translation layer; however, it requires developers to work to port CUDA applications and generate the equivalent source code that can run on AMD GPUs. However, a true translation layer allows for CUDA applications to interface with a non-Nvidia GPU as if it were one on the fly. The most prominent example of a translation layer is ZLUDA, which allowed first Intel and, subsequently, AMD GPU users to run CUDA applications natively without the need for developer intervention or source code generation.

Despite not providing 100% compatibility or performance, ZLUDA received developer interest, and AMD funded the open-source project until recently. The withdrawal of support has been linked to Nvidia’s reiteration of CUDA licensing terms, which prohibits its use for the development of ZLUDA-like translation layers. While no overt legal action has been undertaken by Nvidia, it is clear that the development of CUDA translation layers threatens its market position and lowers the value proposition of its GPUs on price-to-performance metrics.

[In an interesting turn of events, an hour after this post went out, AMD announced a strategic decision to unify their consumer (gaming) and enterprise (AI/server/HPC) GPU design architectures. The new architecture, termed “UDNA”, should enable AMD to do what CUDA did for Nvidia all those years ago: a single hardware architecture (UDNA) that a common programming platform (ROCm) can leverage across multiple domains. This should make it easier for developers across consumer and enterprise sectors to program applications on the entire range of AMD’s GPUs.]

CUDA has undoubtedly accelerated the adoption and innovation of AI. However, from a policy perspective, it is a case study that highlights the unsavoury implications of proprietary software ecosystems in the AI hardware market. Besides market concentration risks, vendor lock-in, and other competition barriers, nation-states seeking to build sovereign AI infrastructure using GPUs will have to contend with the strategic dependency associated with being reliant on a single provider like Nvidia.

Solving for this dependency should be a policy priority. India has substantial human capital resources across both software and chip design sectors. This comparative advantage can possibly be leveraged through state-backed initiatives to promote the development of open-source standards, interfaces, or ZLUDA-like translation layers that can enable application portability across different AI accelerator platforms.

Also, we are hiring! If you are passionate about working on emerging areas of contention at the intersection of technology and international relations, check out the Staff Research Analyst position with Takshashila’s High-Tech Geopolitics programme here.

What We're Reading (or Listening to)

[Takshashila Blog] Chinese Military Aircraft Violates Japanese Airspace: Potential Explanations, by Vanshika Saraf and Anushka Saxena

[Twitter/X Thread] A one-stop thread on India's semiconductor policy ahead of SemiconIndia, by Pranay Kotasthane

[News Article] Leading Tech Giants Join Hands For Bharat Space Collective, by the NDTV News Desk

#99 CUDA Shoulda Gotten Some Competition

The Case of CUDA's Missing Competitors

Also, we are hiring for the role of Research Analyst (Pakistan Studies). If you are keen on this opportunity, apply here!

Cyberpolitik: The Case of CUDA’s Missing Competitors

**This is an excerpt from an upcoming Takshashila Discussion Document where we demystify AI Hardware. As it turns out, AI Hardware cannot be easily divorced from the AI software stack.**

If you like the newsletter, you will love to read our in-depth research and analysis at https://takshashila.org.in/high-tech-geopolitics.

Also, we are hiring! If you are passionate about working on emerging areas of contention at the intersection of technology and international relations, check out the Staff Research Analyst position with Takshashila’s High-Tech Geopolitics programme here.

Discussion about this post

This is an excerpt from an upcoming Takshashila Discussion Document where we demystify AI Hardware. As it turns out, AI Hardware cannot be easily divorced from the AI software stack.