CUDA programming

Introduction to CUDA

In 2007, NVIDIA introduced a revolutionary parallel programming model that could leverage the computing power of their graphics processing units (GPUs). They called this CUDA (Compute Unified Device Architecture). As you will discover in the tutorials, the basic idea is to execute several parallel threads within your code on the GPU, each thread doing the exact same computation, but on different data.

Before starting your adventures in CUDA land, you will need to download the CUDA toolkit from the NVIDIA website and also download the latest drivers for your graphics card.

If you already know C or C++ and have some parallel programming experience, CUDA is very easy to pick up. But even if you have no parallel programming experience whatsoever, these set of tutorials will get you up and running with CUDA in no time at all.

CUDA: Tutorial 001: Getting Device Information

Anytime you encounter a new platform on which you plan to develop applications using CUDA, you have several questions about the hardware like: How many GPUs do I have? What are their specifications? The code below answers these questions.

In this first bit of code, we will find out how many CUDA-capable devices are supported on whatever platform you are running this on and print some information about each such device.

Create a file called deviceInfo.cu and copy the following code into this file. Note the extension - .cu for CUDA source code.


// get device (GPU) information and specifications

#include <iostream>

int main(void)
{
    cudaDeviceProp prop;

    int count;

    cudaGetDeviceCount( &count );

    for(int i=0; i<count; i++)
    {  
        std::cout << "---------------------------------------------------------------" << std::endl;

        cudaGetDeviceProperties(&prop, i);
        std::cout << "Name                             " << prop.name       << std::endl;
        std::cout << "GPU clock rate                   " << (double)prop.clockRate / 1024 << " MHz" << std::endl;
        std::cout << "Compute capability               " << prop.major << "." << prop.minor << std::endl;
        std::cout << "Total global memory              " << (double)prop.totalGlobalMem / (1024*1024) << " MB" << std::endl;
        std::cout << "Total constant memory            " << (double)prop.totalConstMem / (1024) << " KB" << std::endl;
        std::cout << "Shared memory per block          " << (double)prop.sharedMemPerBlock / (1024) << " KB" << std::endl;
        std::cout << "Maximum threads per block        " << prop.maxThreadsPerBlock << std::endl;
        std::cout << "Maximum threads along X          " << prop.maxThreadsDim[0] << std::endl;
        std::cout << "                      Y          " << prop.maxThreadsDim[1] << std::endl;
        std::cout << "                      Z          " << prop.maxThreadsDim[2] << std::endl;
        std::cout << "Maximum grid size along X        " << prop.maxGridSize[0] << std::endl;
        std::cout << "                        Y        " << prop.maxGridSize[1] << std::endl;
        std::cout << "                        Z        " << prop.maxGridSize[2] << std::endl;
        std::cout << "Warp size                        " << prop.warpSize << std::endl;
        std::cout << "Multiprocessor count             " << prop.multiProcessorCount << std::endl;

        std::cout << std::endl;
    }

    return 0;
}

Use the following Makefile to compile the above code:


deviceInfo.x:
    nvcc deviceInfo.cu -o deviceInfo.x

On my 2013 Macbook Pro, I get the following result:

bash-3.2$ ./deviceInfo.x 
------------------------------------------------
Name                             GeForce GT 750M
GPU clock rate                   903.809 MHz
Compute capability               3.0
Total global memory              2047.56 MB
Total constant memory            64 KB
Shared memory per block          48 KB
Maximum threads per block        1024
Maximum threads along X          1024
                      Y          1024
                      Z          64
Maximum grid size along X        2147483647
                        Y        65535
                        Z        65535
Warp size                        32
Multiprocessor count             2

The above result tells us that we have one (1) CUDA-capable GPU and the very first line tells us the name of the GPU - GeForce GT 750M. Some other highlights from this result are:

total global memory (GPU memory available) - 2 GB
Number of streaming multiprocessors - 2

Some really high-end GPUs from NVIDIA have up to 12 GB of memory! For their latest offerings, check out the list of NVIDIA GPUs.

The desktop versions are usually much more powerful than the notebook or laptop versions.

CUDA: Tutorial 002: The First CUDA Kernel

Now that you have a CUDA-capable device, let's move on to writing our first CUDA kernel. Create a file called hello.cu and copy the following code there. Note that although the code has been split into several panels, it should be copied into a single file.


#include <iostream>

Header files


// CUDA kernel (executes on the GPU)

__global__ void printSomething()
{
    // get global thread index
    int t = threadIdx.x;  

    // print a message
    printf("Hello from thread %d\n",t);
}

Device code

In general, each instance of this kernel corresponds to a unique CUDA thread.

Each thread can operate on a unique piece of data in parallel.


// main program (executes on the CPU)

int main(void)
{
    // invoke CUDA kernel
    printMessage<<<1,32>>>();

    // synchronize (wait until GPU is done)
    cudaDeviceSynchronize();

    // end host application
    return 0;    
}

Host code

CUDA kernels are invoked by using the triple chevron notation.

<<< blocks, threads >>>

In this case, we launch the kernel using 1 block and 32 threads in that block.

Kernel launches are asynchronous. Program execution on the host resumes immediately after the kernel is launched. If we do not put in the cudaDeviceSynchronize() call, the main program can terminate before the printf inside the kernel is executed.

To compile this code, use the Makefile given below, with appropriately inserted tabs. Make sure you use the -arch=sm_30 flag, otherwise you will see complaints from the compiler about the printf statement inside the kernel.


all:
    nvcc -arch=sm_30 hello.cu -o hello.x
clean:
    rm *.x

In CUDA-speak, the main program above executes on the "host" (CPU) and the kernel executes on the "device" (GPU).

In this case, this kernel is launched 32 times at once, in parallel. The number of times this kernel is launched is specified in the host code in between the triple chevron signs.

To be more precise, we launch 32 CUDA threads per CUDA block and launch just 1 CUDA block. A collection of 32 threads is called a warp. Inside the kernel, each thread can be uniquely identified by its "id". We can use this "id" to decide what piece of data the kernel operates on. Here, we simply print the "id" to the screen.

The output of the code is copied below:

bash-3.2$ ./hello.x 
Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31

Threads and Warps

That the threads above appear to execute in order is not chance or luck. It is simply because we chose to run a single block with 32 threads all in the same block, like so:

0 1 2 3 4 5 .... 29 30 31

Try changing the code by launching 64 and 96 in 1 block, by changing the second launch parameter inside the triple chevron. Go ahead and do the experiment. I'll wait for you here. Once you are done, look at the results below.

Shown below is a schematic of 64 threads in 1 block, launched using


printMessage<<< 1, 64 >>>();

0 1 2 3 4 5 .... 29 30 31

32 33 34 35 36 37 .... 61 62 63

In the above schematic, I have purposely shown the 64 threads as a set of two warps (1 warp = 32 threads). This is because when the above kernel runs on the GPU, CUDA actually runs the individual threads in units of warps. Let us figure out what this might mean. Shown below are three different runs using 64 threads in one block:

Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31
Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63

Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31
Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63

Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63
Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31

We can make two deductions from the above results:

The printf statements inside the kernel seem to execute as a set of two warps.
Different warps may or may not execute in order each time the code is run.
Inside a warp, the 32 threads execute in order.

Let us see what happens when we launch 96 threads using


printMessage<<< 1, 96 >>>();

Like before, I have organized the 96 threads as a set of three (3) consecutive warps within the block:

0 1 2 3 4 5 .... 29 30 31

32 33 34 35 36 37 .... 61 62 63

64 65 66 67 68 69 .... 93 94 95

Here are three different runs with 96 threads in 1 block:

Hello from thread 64
Hello from thread 65
Hello from thread 66
Hello from thread 67
Hello from thread 68
Hello from thread 69
Hello from thread 70
Hello from thread 71
Hello from thread 72
Hello from thread 73
Hello from thread 74
Hello from thread 75
Hello from thread 76
Hello from thread 77
Hello from thread 78
Hello from thread 79
Hello from thread 80
Hello from thread 81
Hello from thread 82
Hello from thread 83
Hello from thread 84
Hello from thread 85
Hello from thread 86
Hello from thread 87
Hello from thread 88
Hello from thread 89
Hello from thread 90
Hello from thread 91
Hello from thread 92
Hello from thread 93
Hello from thread 94
Hello from thread 95
Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31
Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63

Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31
Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63
Hello from thread 64
Hello from thread 65
Hello from thread 66
Hello from thread 67
Hello from thread 68
Hello from thread 69
Hello from thread 70
Hello from thread 71
Hello from thread 72
Hello from thread 73
Hello from thread 74
Hello from thread 75
Hello from thread 76
Hello from thread 77
Hello from thread 78
Hello from thread 79
Hello from thread 80
Hello from thread 81
Hello from thread 82
Hello from thread 83
Hello from thread 84
Hello from thread 85
Hello from thread 86
Hello from thread 87
Hello from thread 88
Hello from thread 89
Hello from thread 90
Hello from thread 91
Hello from thread 92
Hello from thread 93
Hello from thread 94
Hello from thread 95

Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Hello from thread 4
Hello from thread 5
Hello from thread 6
Hello from thread 7
Hello from thread 8
Hello from thread 9
Hello from thread 10
Hello from thread 11
Hello from thread 12
Hello from thread 13
Hello from thread 14
Hello from thread 15
Hello from thread 16
Hello from thread 17
Hello from thread 18
Hello from thread 19
Hello from thread 20
Hello from thread 21
Hello from thread 22
Hello from thread 23
Hello from thread 24
Hello from thread 25
Hello from thread 26
Hello from thread 27
Hello from thread 28
Hello from thread 29
Hello from thread 30
Hello from thread 31
Hello from thread 32
Hello from thread 33
Hello from thread 34
Hello from thread 35
Hello from thread 36
Hello from thread 37
Hello from thread 38
Hello from thread 39
Hello from thread 40
Hello from thread 41
Hello from thread 42
Hello from thread 43
Hello from thread 44
Hello from thread 45
Hello from thread 46
Hello from thread 47
Hello from thread 48
Hello from thread 49
Hello from thread 50
Hello from thread 51
Hello from thread 52
Hello from thread 53
Hello from thread 54
Hello from thread 55
Hello from thread 56
Hello from thread 57
Hello from thread 58
Hello from thread 59
Hello from thread 60
Hello from thread 61
Hello from thread 62
Hello from thread 63
Hello from thread 64
Hello from thread 65
Hello from thread 66
Hello from thread 67
Hello from thread 68
Hello from thread 69
Hello from thread 70
Hello from thread 71
Hello from thread 72
Hello from thread 73
Hello from thread 74
Hello from thread 75
Hello from thread 76
Hello from thread 77
Hello from thread 78
Hello from thread 79
Hello from thread 80
Hello from thread 81
Hello from thread 82
Hello from thread 83
Hello from thread 84
Hello from thread 85
Hello from thread 86
Hello from thread 87
Hello from thread 88
Hello from thread 89
Hello from thread 90
Hello from thread 91
Hello from thread 92
Hello from thread 93
Hello from thread 94
Hello from thread 95

Again, threads within a warp execute in order, but the three warps are not guaranteed to execute in order.

In general, it is useful to remember that threads within a block are executed in units of warps. No matter how many threads you launch, they will be split into warps and each warp will execute as a unit.

The maximum number of threads you can launch in this case is 1024, as we found earlier in tutorial 001, when we printed out system information.

So what happens if we want to use more than 1024 threads? We use more than 1 block!

Using several blocks

Let's see what happens if we tweak the code in tutorial 002 by distributing the threads over 32 blocks, with a single thread in each block. To implement this in the above code, you need to make two small changes:

Interchange the 1 and 32 between the triple chevron signs while invoking the kernel.
```
printMessage<<<32,1>>>();
```
Replace "threadIdx.x" with "blockIdx.x" in the kernel. This assigns the block id to identify the thread index.
```
    // get global thread index
    int t = blockIdx.x; 
```

The thread layout would now look something like this:

0

1

2

.

29

30

31

In such a case, there is no guarantee that the 32 blocks will execute in some pre-defined order. This is what I get after executing the tweaked code: (two times in succession):

bash-3.2$ ./hello.x 
Hello from thread 1
Hello from thread 6
Hello from thread 3
Hello from thread 0
Hello from thread 5
Hello from thread 2
Hello from thread 7
Hello from thread 4
Hello from thread 9
Hello from thread 14
Hello from thread 11
Hello from thread 8
Hello from thread 13
Hello from thread 10
Hello from thread 15
Hello from thread 12
Hello from thread 17
Hello from thread 22
Hello from thread 19
Hello from thread 16
Hello from thread 21
Hello from thread 18
Hello from thread 23
Hello from thread 20
Hello from thread 25
Hello from thread 30
Hello from thread 27
Hello from thread 24
Hello from thread 29
Hello from thread 26
Hello from thread 31
Hello from thread 28

bash-3.2$ ./hello.x 
Hello from thread 3
Hello from thread 0
Hello from thread 5
Hello from thread 2
Hello from thread 7
Hello from thread 4
Hello from thread 1
Hello from thread 6
Hello from thread 11
Hello from thread 8
Hello from thread 13
Hello from thread 10
Hello from thread 15
Hello from thread 12
Hello from thread 9
Hello from thread 14
Hello from thread 19
Hello from thread 16
Hello from thread 21
Hello from thread 18
Hello from thread 23
Hello from thread 20
Hello from thread 17
Hello from thread 22
Hello from thread 27
Hello from thread 24
Hello from thread 29
Hello from thread 26
Hello from thread 31
Hello from thread 28
Hello from thread 25
Hello from thread 30

You will observe from the above runs that CUDA blocks execute in a random order. You should design your parallel application such that each block can execute completely independently of other blocks.

So what is the best way to distribute threads for a parallel calculation? The answer is it depends. At this stage, it is sufficient to know that:

Different threads within the same block can cooperate by having access to a common memory pool (called shared memory)
The usual way to organize threads is to use several blocks and spawn several threads within each block. Also, threads and blocks can be launched in 2D and in 3D. We'll learn how to do this in tutorial 004.
There is an upper limit to the number of threads that can be spawned inside a given block and an upper limit on the number of blocks that can be launched in a grid (a grid is a collection of blocks).

CUDA: Tutorial 003: Adding Two Vectors

Our next application will use the GPU to add two input vectors (or 1D arrays) of the same size and store the result in a new vector of the same size.

Although simple, this application illustrates the modus operandi of almost all CUDA applications that you will write:

Initialize data on the host (CPU).
Copy the data to the device (GPU).
Carry out calculations in parallel on the GPU to get the result.
Copy results from the device to the host.

As you can imagine, this application is ideal for parallel implementation. Adding elements of arrays can be done completely independently of each other.

Create a file called vecAdd.cu and copy the following code there.


#include <iostream>

// CUDA kernel to add two vectors
__global__ void add_on_GPU(const int *a,   // first input vector
                           const int *b,   // second input vector
                                 int *c,   // result vector
                           const int SIZE) // size of the array
{
    // identify the index of the data this thread should work on
    int id = blockIdx.x;

    // make sure we stay within bounds while doing the additions
    if (id < SIZE)
    {  
        c[id] = a[id] + b[id]; // the actual addition
    }
}

// the main program runs on the CPU and calls kernel code
int main(void)
{
    // parameter describing the size of the 1D array
    const int N = 100;

    // allocate space in the host (CPU) for storing input arrays (a and b) 
    // and the output array (c)
    int *a = new int[N];
    int *b = new int[N];
    int *c = new int[N];

    // Fill arrays "a" and "b" on the host
    for (int i = 0; i < N; i++) {
        a[i] = i,
        b[i] = 2*i;
    }

    // define device pointers for the same arrays 
    int *dev_a, *dev_b, *dev_c;

    // allocate space on the device (GPU)
    cudaMalloc((void **) &dev_a, N*sizeof(int));
    cudaMalloc((void **) &dev_b, N*sizeof(int));
    cudaMalloc((void **) &dev_c, N*sizeof(int));

    // copy array contents from the host (CPU) to the device (GPU)
    cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    // launch the GPU kernel for parallel addition of the input 
    // vectors "dev_a" and "dev_b" and store the result in "dev_c"
    // launch N blocks and use 1 thread per block
    add_on_GPU<<<N,1>>>(dev_a, dev_b, dev_c, N);

    // copy the answer back to the host (CPU) from the device (GPU)
    cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);

    // print the answer to the screen
    for(int i = 0; i < N; i++) {
        std::cout <<  a[i] << " + " << b[i] << " = " << c[i] << std::endl;
    }

    return 0;
}

Device or kernel code

Several instances of the device code or CUDA kernel run simultaneously (in parallel) on the GPU. Each instance has a unique value for id and operates on a different piece of data.

In this example, each id value corresponds to the index of the elements in the vector or 1D array.

The kernel accesses data stored in "global memory" on the GPU. Data is copied to GPU memory from the host.

The result (values for the vector c) is also stored (in parallel) in GPU memory and is later copied back to the host.

The host code

Runs on the CPU.

It initializes data for the input vectors (a and b) and allocates space to store the result (vector c) on both the host and the device.

The input arrays are then copied to the GPU memory.

Next, we launch the CUDA kernel using N threads (N blocks, 1 thread per block to be precise). This passes control to the GPU, which then executes the kernel to calculate the result.

After kernel execution is complete, the result (stored in dev_c, in GPU memory) is copied back to the host.

The code can be compiled using the Makefile below:


vecAdd.x:
     nvcc vecAdd.cu -o vecAdd.x

Here is the result of running the code:

bash-3.2$ ./vecAdd.x
0 + 0 = 0
1 + 2 = 3
2 + 4 = 6 
3 + 6 = 9 
4 + 8 = 12
5 + 10 = 15
6 + 12 = 18
7 + 14 = 21
.
.
.
.
91 + 182 = 273
92 + 184 = 276
93 + 186 = 279
94 + 188 = 282
95 + 190 = 285
96 + 192 = 288
97 + 194 = 291
98 + 196 = 294
99 + 198 = 297

CUDA Tutorial 004: Threads, blocks and grids

As seen in the first few tutorials, the basic unit of parallel execution on the GPU is a thread. In tutorial 002, we looked at two examples of how threads can be organized in CUDA.

We first used a 1D collection of 32 threads inside a single block:

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

We can use the full 3D notation for blocks and threads, using the dim3 type:


dim3 blocks(1, 1, 1);
dim3 threads(32, 1, 1);
kernel <<< blocks, threads >>> (kernel function parameters);

Inside a kernel launched with the above parameters, we can use the following CUDA expressions:


int thread_ID    = threadIdx.x;    // returns an integer in the range [0, 31]
int block_ID     = blockIdx.x;     // returns 0 (we only have 1 block and C counts from 0)
int block_size_x = blockDim.x;     // returns 32 (number of threads in this block along x)
int block_size_y = blockDim.y;     // returns 1  (number of threads in this block along y)
int block_size_z = blockDim.z;     // returns 1  (number of threads in this block along z)

And here is a 2D collection of 8 x 8 (= 64) threads in a block:

  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •
  •   •   •   •   •   •   •   •

This can be represented in CUDA using:

dim3 blocks(1, 1, 1);
dim3 threads(8, 8, 1);

We can also use a 3D arrangement of threads within a block by specifying a value for the third parameter

Organizing CUDA blocks in a 2D grid

CUDA blocks can be arranged in 1D, 2D or 3D fashion, just like threads. The entire set of blocks is referred to as a CUDA grid.

A grid with 4 x 4 x 1 blocks, using 2 x 2 x 1 threads in each block is shown below:

• • • •	• • • •	• • • •	• • • •
• • • •	• • • •	• • • •	• • • •
• • • •	• • • •	• • • •	• • • •
• • • •	• • • •	• • • •	• • • •

The above grid can be specified using:

dim3 threads(2, 2, 1);
dim3 blocks(4, 4, 1);

Abhijit Joshi

ENGINEER. PROGRAMMER. ARTIST.

Introduction to CUDA

CUDA: Tutorial 001: Getting Device Information

CUDA: Tutorial 002: The First CUDA Kernel

Device code

Host code

Threads and Warps

Using several blocks

CUDA: Tutorial 003: Adding Two Vectors

Device or kernel code

The host code

CUDA Tutorial 004: Threads, blocks and grids

Organizing CUDA blocks in a 2D grid