In 2007, NVIDIA introduced a revolutionary parallel programming model that could leverage the computing power of their graphics processing units (GPUs). They called this CUDA (Compute Unified Device Architecture). As you will discover in the tutorials, the basic idea is to execute several parallel threads within your code on the GPU, each thread doing the exact same computation, but on different data.
Before starting your adventures in CUDA land, you will need to download the CUDA toolkit from the NVIDIA website and also download the latest drivers for your graphics card.
If you already know C or C++ and have some parallel programming experience, CUDA is very easy to pick up. But even if you have no parallel programming experience whatsoever, these set of tutorials will get you up and running with CUDA in no time at all.
Anytime you encounter a new platform on which you plan to develop applications using CUDA, you have several questions about the hardware like: How many GPUs do I have? What are their specifications? The code below answers these questions.
In this first bit of code, we will find out how many CUDA-capable devices are supported on whatever platform you are running this on and print some information about each such device.
Create a file called deviceInfo.cu and copy the following code into this file. Note the extension - .cu for CUDA source code.
// get device (GPU) information and specifications
#include <iostream>
int main(void)
{
cudaDeviceProp prop;
int count;
cudaGetDeviceCount( &count );
for(int i=0; i<count; i++)
{
std::cout << "---------------------------------------------------------------" << std::endl;
cudaGetDeviceProperties(&prop, i);
std::cout << "Name " << prop.name << std::endl;
std::cout << "GPU clock rate " << (double)prop.clockRate / 1024 << " MHz" << std::endl;
std::cout << "Compute capability " << prop.major << "." << prop.minor << std::endl;
std::cout << "Total global memory " << (double)prop.totalGlobalMem / (1024*1024) << " MB" << std::endl;
std::cout << "Total constant memory " << (double)prop.totalConstMem / (1024) << " KB" << std::endl;
std::cout << "Shared memory per block " << (double)prop.sharedMemPerBlock / (1024) << " KB" << std::endl;
std::cout << "Maximum threads per block " << prop.maxThreadsPerBlock << std::endl;
std::cout << "Maximum threads along X " << prop.maxThreadsDim[0] << std::endl;
std::cout << " Y " << prop.maxThreadsDim[1] << std::endl;
std::cout << " Z " << prop.maxThreadsDim[2] << std::endl;
std::cout << "Maximum grid size along X " << prop.maxGridSize[0] << std::endl;
std::cout << " Y " << prop.maxGridSize[1] << std::endl;
std::cout << " Z " << prop.maxGridSize[2] << std::endl;
std::cout << "Warp size " << prop.warpSize << std::endl;
std::cout << "Multiprocessor count " << prop.multiProcessorCount << std::endl;
std::cout << std::endl;
}
return 0;
}
Use the following Makefile to compile the above code:
deviceInfo.x:
nvcc deviceInfo.cu -o deviceInfo.x
On my 2013 Macbook Pro, I get the following result:
bash-3.2$ ./deviceInfo.x ------------------------------------------------ Name GeForce GT 750M GPU clock rate 903.809 MHz Compute capability 3.0 Total global memory 2047.56 MB Total constant memory 64 KB Shared memory per block 48 KB Maximum threads per block 1024 Maximum threads along X 1024 Y 1024 Z 64 Maximum grid size along X 2147483647 Y 65535 Z 65535 Warp size 32 Multiprocessor count 2
The above result tells us that we have one (1) CUDA-capable GPU and the very first line tells us the name of the GPU - GeForce GT 750M. Some other highlights from this result are:
Some really high-end GPUs from NVIDIA have up to 12 GB of memory! For their latest offerings, check out the list of NVIDIA GPUs.
The desktop versions are usually much more powerful than the notebook or laptop versions.
Now that you have a CUDA-capable device, let's move on to writing our first CUDA kernel. Create a file called hello.cu and copy the following code there. Note that although the code has been split into several panels, it should be copied into a single file.
|
Header files |
|
Device codeIn general, each instance of this kernel corresponds to a unique CUDA thread. Each thread can operate on a unique piece of data in parallel. |
|
Host codeCUDA kernels are invoked by using the triple chevron notation. <<< blocks, threads  >>> In this case, we launch the kernel using 1 block and 32 threads in that block. |
Kernel launches are asynchronous. Program execution on the host resumes immediately after the kernel is launched. If we do not put in the cudaDeviceSynchronize() call, the main program can terminate before the printf inside the kernel is executed.
To compile this code, use the Makefile given below, with appropriately inserted tabs. Make sure you use the -arch=sm_30 flag, otherwise you will see complaints from the compiler about the printf statement inside the kernel.
all:
nvcc -arch=sm_30 hello.cu -o hello.x
clean:
rm *.x
In CUDA-speak, the main program above executes on the "host" (CPU) and the kernel executes on the "device" (GPU).
In this case, this kernel is launched 32 times at once, in parallel. The number of times this kernel is launched is specified in the host code in between the triple chevron signs.
To be more precise, we launch 32 CUDA threads per CUDA block and launch just 1 CUDA block. A collection of 32 threads is called a warp. Inside the kernel, each thread can be uniquely identified by its "id". We can use this "id" to decide what piece of data the kernel operates on. Here, we simply print the "id" to the screen.
The output of the code is copied below:
bash-3.2$ ./hello.x Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31
That the threads above appear to execute in order is not chance or luck. It is simply because we chose to run a single block with 32 threads all in the same block, like so:
|
Try changing the code by launching 64 and 96 in 1 block, by changing the second launch parameter inside the triple chevron. Go ahead and do the experiment. I'll wait for you here. Once you are done, look at the results below.
Shown below is a schematic of 64 threads in 1 block, launched using
printMessage<<< 1, 64 >>>();
|
In the above schematic, I have purposely shown the 64 threads as a set of two warps (1 warp = 32 threads). This is because when the above kernel runs on the GPU, CUDA actually runs the individual threads in units of warps. Let us figure out what this might mean. Shown below are three different runs using 64 threads in one block:
Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 |
Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 |
Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 |
We can make two deductions from the above results:
Let us see what happens when we launch 96 threads using
printMessage<<< 1, 96 >>>();
Like before, I have organized the 96 threads as a set of three (3) consecutive warps within the block:
|
Here are three different runs with 96 threads in 1 block:
Hello from thread 64 Hello from thread 65 Hello from thread 66 Hello from thread 67 Hello from thread 68 Hello from thread 69 Hello from thread 70 Hello from thread 71 Hello from thread 72 Hello from thread 73 Hello from thread 74 Hello from thread 75 Hello from thread 76 Hello from thread 77 Hello from thread 78 Hello from thread 79 Hello from thread 80 Hello from thread 81 Hello from thread 82 Hello from thread 83 Hello from thread 84 Hello from thread 85 Hello from thread 86 Hello from thread 87 Hello from thread 88 Hello from thread 89 Hello from thread 90 Hello from thread 91 Hello from thread 92 Hello from thread 93 Hello from thread 94 Hello from thread 95 Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 |
Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 Hello from thread 64 Hello from thread 65 Hello from thread 66 Hello from thread 67 Hello from thread 68 Hello from thread 69 Hello from thread 70 Hello from thread 71 Hello from thread 72 Hello from thread 73 Hello from thread 74 Hello from thread 75 Hello from thread 76 Hello from thread 77 Hello from thread 78 Hello from thread 79 Hello from thread 80 Hello from thread 81 Hello from thread 82 Hello from thread 83 Hello from thread 84 Hello from thread 85 Hello from thread 86 Hello from thread 87 Hello from thread 88 Hello from thread 89 Hello from thread 90 Hello from thread 91 Hello from thread 92 Hello from thread 93 Hello from thread 94 Hello from thread 95 |
Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Hello from thread 4 Hello from thread 5 Hello from thread 6 Hello from thread 7 Hello from thread 8 Hello from thread 9 Hello from thread 10 Hello from thread 11 Hello from thread 12 Hello from thread 13 Hello from thread 14 Hello from thread 15 Hello from thread 16 Hello from thread 17 Hello from thread 18 Hello from thread 19 Hello from thread 20 Hello from thread 21 Hello from thread 22 Hello from thread 23 Hello from thread 24 Hello from thread 25 Hello from thread 26 Hello from thread 27 Hello from thread 28 Hello from thread 29 Hello from thread 30 Hello from thread 31 Hello from thread 32 Hello from thread 33 Hello from thread 34 Hello from thread 35 Hello from thread 36 Hello from thread 37 Hello from thread 38 Hello from thread 39 Hello from thread 40 Hello from thread 41 Hello from thread 42 Hello from thread 43 Hello from thread 44 Hello from thread 45 Hello from thread 46 Hello from thread 47 Hello from thread 48 Hello from thread 49 Hello from thread 50 Hello from thread 51 Hello from thread 52 Hello from thread 53 Hello from thread 54 Hello from thread 55 Hello from thread 56 Hello from thread 57 Hello from thread 58 Hello from thread 59 Hello from thread 60 Hello from thread 61 Hello from thread 62 Hello from thread 63 Hello from thread 64 Hello from thread 65 Hello from thread 66 Hello from thread 67 Hello from thread 68 Hello from thread 69 Hello from thread 70 Hello from thread 71 Hello from thread 72 Hello from thread 73 Hello from thread 74 Hello from thread 75 Hello from thread 76 Hello from thread 77 Hello from thread 78 Hello from thread 79 Hello from thread 80 Hello from thread 81 Hello from thread 82 Hello from thread 83 Hello from thread 84 Hello from thread 85 Hello from thread 86 Hello from thread 87 Hello from thread 88 Hello from thread 89 Hello from thread 90 Hello from thread 91 Hello from thread 92 Hello from thread 93 Hello from thread 94 Hello from thread 95 |
Again, threads within a warp execute in order, but the three warps are not guaranteed to execute in order.
In general, it is useful to remember that threads within a block are executed in units of warps. No matter how many threads you launch, they will be split into warps and each warp will execute as a unit.
The maximum number of threads you can launch in this case is 1024, as we found earlier in tutorial 001, when we printed out system information.
So what happens if we want to use more than 1024 threads? We use more than 1 block!
Let's see what happens if we tweak the code in tutorial 002 by distributing the threads over 32 blocks, with a single thread in each block. To implement this in the above code, you need to make two small changes:
printMessage<<<32,1>>>();
// get global thread index
int t = blockIdx.x;
The thread layout would now look something like this:
0 | 1 | 2 | . | . | 29 | 30 | 31 |
In such a case, there is no guarantee that the 32 blocks will execute in some pre-defined order. This is what I get after executing the tweaked code: (two times in succession):
bash-3.2$ ./hello.x Hello from thread 1 Hello from thread 6 Hello from thread 3 Hello from thread 0 Hello from thread 5 Hello from thread 2 Hello from thread 7 Hello from thread 4 Hello from thread 9 Hello from thread 14 Hello from thread 11 Hello from thread 8 Hello from thread 13 Hello from thread 10 Hello from thread 15 Hello from thread 12 Hello from thread 17 Hello from thread 22 Hello from thread 19 Hello from thread 16 Hello from thread 21 Hello from thread 18 Hello from thread 23 Hello from thread 20 Hello from thread 25 Hello from thread 30 Hello from thread 27 Hello from thread 24 Hello from thread 29 Hello from thread 26 Hello from thread 31 Hello from thread 28 |
bash-3.2$ ./hello.x Hello from thread 3 Hello from thread 0 Hello from thread 5 Hello from thread 2 Hello from thread 7 Hello from thread 4 Hello from thread 1 Hello from thread 6 Hello from thread 11 Hello from thread 8 Hello from thread 13 Hello from thread 10 Hello from thread 15 Hello from thread 12 Hello from thread 9 Hello from thread 14 Hello from thread 19 Hello from thread 16 Hello from thread 21 Hello from thread 18 Hello from thread 23 Hello from thread 20 Hello from thread 17 Hello from thread 22 Hello from thread 27 Hello from thread 24 Hello from thread 29 Hello from thread 26 Hello from thread 31 Hello from thread 28 Hello from thread 25 Hello from thread 30 |
You will observe from the above runs that CUDA blocks execute in a random order. You should design your parallel application such that each block can execute completely independently of other blocks.
So what is the best way to distribute threads for a parallel calculation? The answer is it depends. At this stage, it is sufficient to know that:
Our next application will use the GPU to add two input vectors (or 1D arrays) of the same size and store the result in a new vector of the same size.
Although simple, this application illustrates the modus operandi of almost all CUDA applications that you will write:
As you can imagine, this application is ideal for parallel implementation. Adding elements of arrays can be done completely independently of each other.
Create a file called vecAdd.cu and copy the following code there.
#include <iostream>
// CUDA kernel to add two vectors
__global__ void add_on_GPU(const int *a, // first input vector
const int *b, // second input vector
int *c, // result vector
const int SIZE) // size of the array
{
// identify the index of the data this thread should work on
int id = blockIdx.x;
// make sure we stay within bounds while doing the additions
if (id < SIZE)
{
c[id] = a[id] + b[id]; // the actual addition
}
}
// the main program runs on the CPU and calls kernel code
int main(void)
{
// parameter describing the size of the 1D array
const int N = 100;
// allocate space in the host (CPU) for storing input arrays (a and b)
// and the output array (c)
int *a = new int[N];
int *b = new int[N];
int *c = new int[N];
// Fill arrays "a" and "b" on the host
for (int i = 0; i < N; i++) {
a[i] = i,
b[i] = 2*i;
}
// define device pointers for the same arrays
int *dev_a, *dev_b, *dev_c;
// allocate space on the device (GPU)
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
// copy array contents from the host (CPU) to the device (GPU)
cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);
// launch the GPU kernel for parallel addition of the input
// vectors "dev_a" and "dev_b" and store the result in "dev_c"
// launch N blocks and use 1 thread per block
add_on_GPU<<<N,1>>>(dev_a, dev_b, dev_c, N);
// copy the answer back to the host (CPU) from the device (GPU)
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);
// print the answer to the screen
for(int i = 0; i < N; i++) {
std::cout << a[i] << " + " << b[i] << " = " << c[i] << std::endl;
}
return 0;
}
Several instances of the device code or CUDA kernel run simultaneously (in parallel) on the GPU. Each instance has a unique value for id and operates on a different piece of data.
In this example, each id value corresponds to the index of the elements in the vector or 1D array.
The kernel accesses data stored in "global memory" on the GPU. Data is copied to GPU memory from the host.
The result (values for the vector c) is also stored (in parallel) in GPU memory and is later copied back to the host.
Runs on the CPU.
It initializes data for the input vectors (a and b) and allocates space to store the result (vector c) on both the host and the device.
The input arrays are then copied to the GPU memory.
Next, we launch the CUDA kernel using N threads (N blocks, 1 thread per block to be precise). This passes control to the GPU, which then executes the kernel to calculate the result.
After kernel execution is complete, the result (stored in dev_c, in GPU memory) is copied back to the host.
The code can be compiled using the Makefile below:
vecAdd.x:
nvcc vecAdd.cu -o vecAdd.x
Here is the result of running the code:
bash-3.2$ ./vecAdd.x 0 + 0 = 0 1 + 2 = 3 2 + 4 = 6 3 + 6 = 9 4 + 8 = 12 5 + 10 = 15 6 + 12 = 18 7 + 14 = 21 . . . . 91 + 182 = 273 92 + 184 = 276 93 + 186 = 279 94 + 188 = 282 95 + 190 = 285 96 + 192 = 288 97 + 194 = 291 98 + 196 = 294 99 + 198 = 297
As seen in the first few tutorials, the basic unit of parallel execution on the GPU is a thread. In tutorial 002, we looked at two examples of how threads can be organized in CUDA.
We first used a 1D collection of 32 threads inside a single block:
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • |
We can use the full 3D notation for blocks and threads, using the dim3 type:
dim3 blocks(1, 1, 1);
dim3 threads(32, 1, 1);
kernel <<< blocks, threads >>> (kernel function parameters);
Inside a kernel launched with the above parameters, we can use the following CUDA expressions:
int thread_ID = threadIdx.x; // returns an integer in the range [0, 31]
int block_ID = blockIdx.x; // returns 0 (we only have 1 block and C counts from 0)
int block_size_x = blockDim.x; // returns 32 (number of threads in this block along x)
int block_size_y = blockDim.y; // returns 1 (number of threads in this block along y)
int block_size_z = blockDim.z; // returns 1 (number of threads in this block along z)
And here is a 2D collection of 8 x 8 (= 64) threads in a block:
• • • • • • • •   • • • • • • • •   • • • • • • • •   • • • • • • • •   • • • • • • • •   • • • • • • • •   • • • • • • • •   • • • • • • • • |
This can be represented in CUDA using:
dim3 blocks(1, 1, 1); dim3 threads(8, 8, 1);
We can also use a 3D arrangement of threads within a block by specifying a value for the third parameter
CUDA blocks can be arranged in 1D, 2D or 3D fashion, just like threads. The entire set of blocks is referred to as a CUDA grid.
A grid with 4 x 4 x 1 blocks, using 2 x 2 x 1 threads in each block is shown below:
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
• •   • •   |
The above grid can be specified using:
dim3 threads(2, 2, 1); dim3 blocks(4, 4, 1);