Comprehensive Guide to Getting Started with CUDA Programming
Comprehensive Guide to Getting Started with CUDA Programming
Introduction
Understanding the Basics of CUDA
GPU programming involves multiple components such as the CPU, GPU, memory, and video memory. It's crucial to define these concepts clearly:
- Host: Refers to the CPU and system memory.
- Device: Refers to the GPU and video memory.
Let’s start with a simple "Hello World" example in CUDA:
__global__ void kernel(void) {
}
int main(){
kernel<<<1, 1>>>();
printf("Hello world\n");
return 0;
}
This example showcases a few important differences from a traditional C program:
- The
__global__
prefix before the functionkernel
informs the compiler that this function will execute on the GPU. - The
<<<1, 1>>>
notation indicates the launch configuration where1,1
represents the number of GPU "cores" that may execute this function in parallel.
In this "Hello World" example, the GPU function does not accept any parameters or return values. In a more meaningful scenario, we usually expect input and output. Consider the following example that adds two integers:
__global__ void add(int a, int b, int *c) {
*c = a + b;
}
int main() {
int c;
int *dev_c;
cudaMalloc((void**)&dev_c, sizeof(int));
add<<<1, 1>>>(2, 7, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
printf("2+7=%d\n", c);
cudaFree(dev_c);
return 0;
}
In this example, the GPU performs the addition of two integers. Here are some key points to note:
- Memory Allocation: The
cudaMalloc
function allocates device memory for the variable. Unlike traditionalmalloc
, the return value indicates success or failure. We pass a pointer to capture the address of the allocated memory. - Function Invocation: The line
add<<<1, 1>>>(2, 7, dev_c)
invokes the GPU function to compute the sum. - Data Transfer: The
cudaMemcpy
function is used to copy data between the host and device. - Freeing Memory: After use, the allocated device memory is freed with
cudaFree
.
Retrieving Device Information
To check the number of GPUs available on the current machine, use the following function:
int count;
cudaGetDeviceCount(&count);
You can also query the properties of each GPU:
cudaDeviceProp prop;
for (int i = 0; i < count; i++) {
cudaGetDeviceProperties(&prop, i);
}
The cudaDeviceProp
structure contains various attributes such as device name, total memory, and supported features.
When multiple GPU devices are present (for instance, both integrated and dedicated graphics), certain operations may require specific GPU versions. CUDA provides two APIs to quickly identify compatible devices:
cudaDeviceProp prop;
memset(&prop, 0, sizeof(cudaDeviceProp));
prop.major = 1;
prop.minor = 3;
int dev;
cudaChooseDevice(&dev, &prop);
printf("ID of CUDA device closest to revision 1.3: %d\n", dev);
cudaSetDevice(dev);
Parallel Programming in CUDA C
Example: Vector Addition
In the previous section, we introduced the basic structure of a CUDA program. Now, let's explore how to implement parallel computation in CUDA.
Here’s a simple example of adding two vectors:
#define N 10
int a[N], b[N], c[N];
// Initialize a and b, and set c to zero
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
The GPU version of this operation can be implemented as follows:
__global__ void add(int *a, int *b, int *c) {
int tid = blockIdx.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
int main() {
// Suppose `int a[N], b[N], c[N]` are initialized
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_c, N * sizeof(int));
add<<<N, 1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) {
printf("%d\n", c[i]);
}
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}
Julia Set Example
The vector addition example is relatively straightforward. Now let's take a look at a more complex example: generating the Julia Set. The Julia Set is a type of fractal, where each pixel is determined by an iterative process based on its position.
Here’s the CPU implementation:
int main() {
CPUBitmap bitmap(DIM, DIM);
unsigned char *ptr = bitmap.get_ptr();
kernel(ptr);
bitmap.display_and_exit();
}
void kernel(unsigned char *ptr) {
for (int y = 0; y < DIM; y++) {
for(int x = 0; x < DIM; x++) {
int offset = x + y * DIM;
int juliaValue = julia(x, y);
ptr[offset * 4 + 0] = 255 * juliaValue;
ptr[offset * 4 + 1] = ptr[offset * 4 + 2] = 0;
ptr[offset * 4 + 3] = 255;
}
}
}
int julia(int x, int y) {
const float scale = 1.5;
float jx = scale * (float)(DIM / 2 - x) / (DIM / 2);
float jy = scale * (float)(DIM / 2 - y) / (DIM / 2);
cuComplex c(-0.8, -0.156);
cuComplex a(jx, jy);
for (int i = 0; i < 200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000) {
return 0;
}
}
return 1;
}
Now, let's look at the GPU version of this implementation:
int main() {
CPUBitmap bitmap(DIM, DIM);
unsigned char *dev_bitmap;
cudaMalloc((void**)&dev_bitmap, bitmap.image_size());
dim3 grid(DIM, DIM);
kernel<<<grid, 1>>>(dev_bitmap);
cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);
bitmap.display_and_exit();
cudaFree(dev_bitmap);
}
Understanding Kernel Function Implementation
The kernel function implementation for the GPU is as follows:
__global__ void kernel(unsigned char *ptr) {
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
int juliaValue = julia(x, y);
ptr[offset * 4 + 0] = 255 * juliaValue;
ptr[offset * 4 + 1] = ptr[offset * 4 + 2] = 0;
ptr[offset * 4 + 3] = 255;
}
Key points to note in this implementation:
- The
__global__
keyword indicates that this function runs on the GPU but can be called from the host. - The indices of
blockIdx.x
andblockIdx.y
are used to determine which pixel to compute. gridDim
is a shared constant that holds the grid configuration.
Thread Cooperation
In the previous chapter, we introduced block-level parallelism. However, blocks are isolated and cannot directly communicate with each other. In this chapter, we will discuss thread-level cooperation, where threads within the same block can communicate.
Splitting Blocks into Threads
CUDA allows each block to contain multiple threads, specified by the second parameter in the launch configuration <<<N, M>>>
. Here’s a simple example demonstrating this concept:
__global__ void add(int *a, int *b, int *c) {
int tid = threadIdx.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
In the main function:
int main() {
int a[N], b[N], c[N];
// Initialize a[N] and b[N]
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_c, N * sizeof(int));
add<<<1, N>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
// Display c
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}
In this case, the first argument in the kernel invocation <<<1, N>>>
indicates that there is one block containing N threads.
Synchronization and Shared Memory
The introduction of thread-level parallelism allows multiple threads within a block to share memory. This is done using the __shared__
keyword. Here’s an example of calculating the dot product using shared memory:
const int N = 33 * 1024;
const int threadsPerBlock = 256;
__global__ void dot(float *a, float *b, float *c) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = temp;
__syncthreads();
// Reduction logic follows
}
The synchronization function __syncthreads()
ensures that all threads in the block have completed their computations before proceeding.
Reduction Techniques
To efficiently compute the sum across the shared cache, several techniques can be employed, such as using shared memory for intermediate sums and then reducing them within a single thread, often referred to as the "thread reduction" technique.
Using the cudaMemcpyAsync
function allows for asynchronous memory transfers, enabling better utilization of available resources by overlapping computation with data transfer.
Conclusion and Next Steps
This introduction to CUDA programming provides a foundation for understanding how to leverage GPU resources for parallel computation. In future sections, we will delve deeper into advanced topics such as constant memory, texture memory, and multi-GPU setups to further enhance performance.
This guide serves as a comprehensive introduction to CUDA programming, encapsulating fundamental concepts and providing practical examples to help you get started efficiently. For further learning, exploring advanced CUDA features and optimizing existing code will be beneficial.