Nvidia GPUs from scratch: the most completed article to get proficiency.

Nvidia sounds familiar to many people because of gaming. These people know that Nvidia is the company who designs the graphic cards of their computers and thanks to these devices, they have an extraordinary gaming experience when playing videogames. Actually, there is an universal rule for this experience: the more expensive the graphic card, the better experience you get. However, trying to understand what is behind the scene, or in other words, why graphic cards get such realistic effect that you cannot distinguish if what you watch is a videogame or the last sci-fi movie of Tom Cruise, it is not an easy task. In the following paragraphs I will explain the architectural evolution from the old Nvidia cards to the recent releases. Thus, if you have some basic computer architecture and programming notions, and you want to find out how a GPU works, please continue reading.

My first suggestion to start this 10-min speech is: please, do not use the graphic cards term anymore. We are talking about Graphic Processing Units, aka GPUs, an special architecture originally created to help the CPU accelerates the computation of graphics. This is the reason why you can also hear about them as accelerators. In the early beginning, these graphic units were very simple, but it demonstrated a huge computational power when computing the graphics. They had a parallel design to do much more work per second compared to the processor performance of that time. Crystal clear. Why not to use then GPUs to also compute algorithms and floating-point operations rather than only graphics? Eureka! Since the moment that someone had that idea, Nvidia clearly started to specialize their architectures for two objectives: I gonna have a design line for (high-performance) computing and another line for rendering graphics. So if you take a look to the TOP500 list of the most powerful supercomputers in the word, you will see that the fastest supercomputers are composed of GPUs. Otherwise, it would be impossible to reach the TFLOPs rates that we achieve nowadays. It is cool, isn’t? Let’s start diving into deeper details needed to understand the hardware architecture.

Reading the title of the article you can guess I am going to focus on the computing line, which is my research field, but these concepts are also recommended if you want to really understand the rendering part too.

What is CUDA?

You have likely heard about CUDA. Let me introduce the reasons why CUDA exists: GPUs are parallel processors designed to accelerate portions of a program, but not to replace CPU computing. The main program is executed on the CPU, but some code fragments, called kernels, are executed on the GPU. How can we divide this execution? A wild CUDA appeared here! CUDA allows programming the Nvidia GPUs using an extension of C language.

In the case of graphics, then we would talk about OpenGL or DirectX. Please, do not to be confused with OpenCL, which implements the “same” as CUDA but in an open standard. Note the quotes ;)

How is the execution model you are talking about?

From the CPU, the user specifies how the code to be accelerated is mapped to the logical thread hierarchy of the GPU. Internally, CUDA handles the execution of the program by scheduling and processing the logical threads over the GPU physical cores. A typical processing flow in a CUDA program is as follows:

Allocate space in the device memory > Copy data from CPU memory to GPU memory > Invoke kernels from the CPU to perform the computation in the GPU > Copy results back from GPU to CPU memory > Release memory space in the GPU.

As can be observed, the CPU and the GPU have separated memories. One of the most important features of CUDA is its memory hierarchy, where the device has different memory types, depending on the purpose.

When a kernel is invoked from the host, the execution is performed in the device, where a large number of logical threads are created and organized following the user’s indications. In the same way as other programming languages, each logical thread is in charge of doing one task. Specifically, all threads will execute the same instructions of the code, but the idea is that each thread will load different portions of data to send to those instructions. For example, imaging your dummy algorithm to be parallelized is something like x[i] =array1[i]+array2[i], where the arrays have size 5. If you send this instruction to 5 CUDA threads, then each CUDA thread will compute the operation with the corresponding data in parallel. This is important because any O(N) loop candidate to be parallelized may be reduced to O(1).

These threads follow a two-level thread hierarchy abstraction: blocks of threads and grids of blocks. In other words, instead of asking for the thread 130, CUDA will ask for the thread 2 of the block 1 where each block is composed of 128 threads. This is very important, since it determines how the GPU resources are distributed and how the GPU memory system is accessed, which has a bearing on performance.

Tell me something about the hardware architecture.

Here we go.

The GPU architecture is built of an array of Streaming Multiprocessors (SM). The parallelism is achieved by the replication of these SMs. Each SM is mainly composed of many CUDA cores for single and double precision, also known as Streaming Processors (SP), a shared memory, an L1 cache, a register file, load/store units, special function units (SFU), warp schedulers and memory controllers.

Each SM is designed to execute hundreds of threads concurrently. When a kernel is invoked, the blocks of the grid are distributed among the available SMs for execution, and an SM can hold several blocks concurrently. Once dispatched, their threads execute concurrently on that assigned SM only. Each block groups its threads into warps, a set of 32 threads that executes instructions in lockstep; i.e, all threads in a warp execute the same instruction at the same time. Thus, each SM partitions its blocks into warps, and these warps are scheduled for execution on available SM resources.

There are two types of instruction latency: arithmetic instruction latency (around 10–20 cycles) and memory instruction latency (between 400–800 cycles for global memory accesses). Switching active warps makes it possible to hide the latency.

Extra bonus for proficiency. It should be observed that the term thread can be confusing: while all threads in a block run logically in parallel, not all threads can execute physically at the same time. The GPU programming model executes Single Instruction Multiple Thread (SIMT) operations by matching each physical thread as a number of logical threads. The SIMT architecture is similar to the Single Instruction, Multiple Data (SIMD) architecture: both broadcast the same instruction to multiple execution units. However, SIMD requires that all elements in a vector execute together in a synchronous group; while SIMT allows multiple threads in a warp to execute independently. This allows different threads in the same warp to take different instruction paths (in case of branch divergence}. In case of divergence, CUDA disables some of the threads (using a mask) in the warp and executes instructions on one path; then it disables the other threads of the warp and executes instructions on the other path. If we get that all threads follow the same path, we can save cycles!

What do you say about memory hierarchy?

Memory management and accesses are an important part of GPUs, having a particularly large impact on performance. CUDA presents a low-latency but lower-capacity memory subsystem to optimize performance. This subsystem is composed of multiple levels of memory with different latency, bandwidth and capacities.

Global memory is the largest, but highest-latency, memory on a GPU. It can be accessed by any thread, even after kernel execution. Global memory resides in the device memory, an off-chip on-board DRAM memory. Registers are the fastest memory space. Each thread has its own set of private registers, and any variable declared in the kernel is generally stored in a register. Once a kernel completes the execution, a register value cannot be accessed.

Furthermore, shared memory is a programmable on-chip memory, and it has much higher bandwidth and much lower latency than global memory. Shared memory shares the lifetime of the kernel, and serves as a inter-thread communication inside a block; thus, only threads within a threadblock can access this memory space. When a block is finished, its allocation of shared memory is released. Constant memory, texture memory are other memory spaces optimized for locality.

Extra bonus for proficiency. Currently, most HPC workloads are bound by memory bandwidth. Especially in GPUs, most applications tend to be limited by the global memory bandwidth. Certain conditions need to be met to achieve the maximum performance when reading and writing data in this memory. The allocated CPU memory is pageable; i.e, the operating system can move the data allocated in this memory to different physical locations (virtual memory system). This enables us to use more memory than that physically available. If the GPU has to transfer data from/to this pageable host memory, a page-locked or pinned host buffer will need to be created to move data safely. Thus, data are first moved from host memory to the pinned buffer, and then to the device memory. Pinned host memory can be allocated directly, to avoid the initial data transfer, achieving a high speed-up. Zero-copy memory is a pinned host memory space that is mapped into the device address space, being possible to access this memory from both host and device, performing data transfers across PCI-e by demand. To improve the zero-copy behavior, CUDA 6.0 introduces the Unified Memory to simplify the memory management. The zero-copy memory is allocated in the host, and the kernel suffers from the latency of PCI-e transfers. However, the unified memory decouples the memory spaces to the host and the device, thus data are transparently migrated on demand, improving locality and performance. This is possible thanks to the Unified Virtual Addressing (UVA) support, which provides a single virtual memory address space for all CPU and GPU memories, although it does not automatically migrate data, which is only done by the unified memory.

Extra bonus for proficiency. Shared memory is faster than global memory as it is a low-latency on-chip memory. Shared memory is smaller and it is only reachable by the threads within the same block, but offering much higher bandwidth.The shared memory is divided into 32 memory modules called memory banks. These modules are accessible simultaneously and have a bank width that depends on the architecture. For example, if the 32 threads of a warp access to different banks simultaneously, the operations are serviced by one memory transaction. There are two different bank widths depending on the architecture: 4-byte or 8-byte widths. In the first case, successive 4-byte words are mapped to consecutive banks, and each bank has a bandwidth of 4 bytes per two clock cycles. In the second case, there are two address modes: 8-byte or 4-byte modes. In the 8-byte mode, successive 8-byte words are assigned to consecutive banks, and each bank has a bandwidth of 8 bytes per cycle.

GPUs are awesome! Is there something with more performance than one GPU?

Yes, two GPUs.

The success of HPC is to be able of combining several nodes with low-latency interconnection networks to provide more and more performance. Let’s see how to connect several GPUs thanks to the Nvidia technology.

The efficiency of the execution depends on how the inter-GPU communication is designed. It is possible to distinguish two types of environments: a Multi-GPU environment represents a single computing node composed of several GPUs; whereas if the system consists of several of these nodes connected through a low-latency network, it is called a Multi-Node environment. When GPUs are arranged in several nodes a Multi-Node communication is required.

CUDA presents a number of features to facilitate Multi-GPU programming. Kernels executed under 64-bit applications on modern devices can directly access the global memory of any GPU connected to the same PCIe network using the CUDA peer-to-peer (P2P) API, avoiding communication via the host. This is possible thanks to their sharing a common memory address space (UVA). Hence, data are copied between these devices asynchronously along the shortest PCI-e path, enabling communication-computation overlapping. Specifically, peer-to-peer accesses enable direct load and store operations within a kernel across GPUs. If the GPUs are not connected to the same PCI-e bus, it is possible to transfer data from host, peer-to-peer transactions, through host memory rather than directly across the PCI-e bus. Synchronization between devices can be performed by assigning a CUDA stream to each GPU.

In the case of Multi-Node programming, the communication is performed across a cluster composed of several computing nodes. In this case, Message Passing Interface (MPI), a well-known standard and portable API, is employed. Using MPI, the contents of host memory can be transmitted directly by MPI functions. Instead of copying data from the device memory to the host buffers, and then calling the MPI API, MPI and CUDA can be combined, sending data directly to the GPU buffers. This CUDA support is called CUDA-Aware MPI, enabling direct MPI communication between GPU global memories. Moreover, RDMA — GPU Direct technology enables low-latency transfers over an Infiniband connection between GPUs in different nodes without host processor involvement, reducing CPU overhead and communication latency.

If you have reached this line, you are undoubtedly interested in GPUs. The best way of practicing is coding and executing some samples on a GPU, but after reading this article you have fulfilled all requirement for understanding why your C codes are running faster in CUDA. I strongly recommend you to review the different features of each Nvidia architecture and continue researching about this awesome world that accelerates the block-chain technology, the deep-learning and the computing-intensive algorithms.


CUDA Toolkit Guide

Nvidia A100 chip review

Tuning CUDA code for Ampere

Parallel algorithms on GPUs



PhD in Parallel algorithms, Data distribution and GPUs. Researcher at Berkeley Lab, California

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian PD

PhD in Parallel algorithms, Data distribution and GPUs. Researcher at Berkeley Lab, California