CUDA学习笔记——入门例子

虽然配置CUDA的目的主要是支持PyTorch的机器学习，但是如果需要直接手写CUDA代码，并在Windows上使用nvcc顺利编译和执行纯CUDA程序，还需要Visual Studio相关工具的支持。（MINGW似乎不太行，或者说鼓捣起来太麻烦，Windows中的nvcc默认是配合MSVC进行编译的）鉴于我在电脑上安装了VS2022，并且CUDA自动给VS2022加了插件适配，因此尝试学一点CUDA，并写几个简单的HelloWorld程序。

基本概念

CUDA编程模型是一个异构模型，需要CPU和GPU协同工作，通常CPU负责处理逻辑复杂的串行程序，而GPU重点处理数据密集型的并行计算程序，从而发挥最大功效。在CUDA中，用host指代CPU及其内存，用device指代GPU及显存。CUDA程序中既包含host程序，又包含device程序，它们分别在CPU和GPU上运行。 host与device之间可以进行通信，在它们之间通信需要进行数据拷贝。一个典型的CUDA程序的计算流程如下：

分配host内存，并进行数据初始化；
分配device显存，并从host将数据拷贝到device上；
调用CUDA的核函数在device上完成指定的运算；
将device上的运算结果拷贝到host上；
释放device和host上分配的内存。

在CUDA编程中需要区分在host和device上执行的代码，对于函数需要通过特殊限定词来区分，主要的三个函数类型限定词如下：

__global__：函数在device执行，仅在host上调用，这种标记的函数又称为核函数（kernel）
__device__：函数在device执行，仅在device上调用
__host__：函数在host执行，仅在host上调用

注意：

标记为__host__的函数会被nvcc交给系统默认编译器进行编译，标记为__device__的函数则由nvcc负责处理，nvcc最后处理两者的交互部分。
__host__是不含任何限定词时的默认情形，此时调用和执行都在host进行，相当于常规的C语言函数，因此很少单独使用这个标记
__host__可以和__device__组合使用，此时代码将在host和device各自编译一份（还可以通过特殊的宏进行条件编译）

对于__global__和__device__，由于需要在device执行或调用，因此相比于普通的C语言函数由更多的要求：

不支持递归
不支持在函数体内声明静态变量
不支持可变长参数

最重要的概念是__global__标记的核函数，对核函数还有更多的要求：

__global__限定词不能与其它限定词组合使用
返回类型必须是void，因此获取返回值必须通过函数参数指针实现
不能作为类的成员函数
...

核函数的定义和调用示例如下：

核函数由host调用，但是由device上线程负责并行执行，一个最简单的核函数如下，这里不仅返回值是void，甚至没有函数参数，只是打印一个字符串。
1
2
3
4
__global__ void hello_world_kernel()
{
printf("Hello GPU!\n");
}
在调用核函数时，需要加上<<<grid, block>>>来指定执行核函数的线程数量，<<<grid, block>>> 表示启动一个grid个线程块，每个线程块中包含block个线程的CUDA网格，其中所有的线程一起执行核函数，示例如下，这里cudaDeviceReset()用于重置当前设备上的所有资源状态，建议在程序的最后调用这个函数，以确保释放所有由CUDA运行时分配的资源
1
2
3
4
5
int main()
{
hello_world_kernel <<<1,8>>>();
cudaDeviceReset();
}
每一个线程在各自执行核函数时，会分配一个唯一的线程号thread ID，这个ID值可以通过核函数的内置变量threadIdx来获得，从而可以让不同的线程根据拓扑结构或线程号相互合作，例如
1
2
3
4
5
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}

注意：核函数都是异步执行，即当核函数在GPU中唤醒后，不论函数是否执行完毕，控制权都会立刻返回给host。

例子

例一

创建一个名为hello_world_gpu.cu的文件，写入如下内容，可以看出它在语法上基本上是C语言加上了一些定制扩展，因此nvcc需要MSVC的支持

#include <stdio.h>

__global__ void hello_world_kernel()
{
    printf("Hello GPU!\n");
}

int main()
{
    hello_world_kernel <<<1,8>>>();
    cudaDeviceReset();
}

编译执行CUDA程序需要特殊的指令，基于VS的GUI的操作略，对于命令行操作，在Developer PowerShell for Visual Studio 2022中执行如下语句

1	nvcc hello_world_gpu.cu -o hello_world_gpu.exe

可以编译得到 hello_world_gpu.exe，运行结果如下

Hello GPU!
Hello GPU!
Hello GPU!
Hello GPU!
Hello GPU!
Hello GPU!
Hello GPU!
Hello GPU!

这里编译必须在Visual Studio配置好的命令行环境中，否则会报错找不到cl.exe等，但是编译之后的程序运行并不需要特殊的命令行环境。

除了直接使用Visual Studio，还可以通过cmake进行构建编译（但是仍然需要选择MSVC进行编译），参考脚本如下

cmake_minimum_required(VERSION 3.18)
project(CudaExample LANGUAGES CXX CUDA)

# 设置C++标准
set(CMAKE_CXX_STANDARD 14)

# 添加可执行文件
add_executable(cuda_example kernel.cu)

# 链接CUDA库
target_link_libraries(cuda_example PRIVATE ${CUDA_LIBRARIES})

例二

第一个程序过于简单，没有返回值，下面的例子是VS2022中自带的项目，只有一个kernel.cu文件，完整代码如下

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);

    return cudaStatus;
}

下面来消化理解一下这个例子。

首先关注核函数的部分，这个程序的核心功能就是让GPU来计算一个向量加法，把每一个元素的加法分配给一个线程来执行，这里的 c 是返回值向量，a 和 b 是输入向量。

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

然后关注main函数部分，首先定义了向量a和向量b，以及存储结果的向量c，然后调用addWithCuda函数执行，输出计算结果c，最后释放相应资源。

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

剩下的部分是addWithCuda函数，它就是在给GPU分配这个向量加法任务并处理调用GPU计算前后的细节：

cudaMalloc申请GPU显存，大小对应a，b，c的尺寸
cudaMemcpy将a,b,c从CPU内存拷贝到GPU显存中，得到副本dev_a,dev_b,dev_c
执行核函数 addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
将计算结果的c从GPU显存拷贝到CPU内存中，并返回

这是CUDA官方提供的例子，代码的健壮性很高，因此包含了完善的错误处理，主要内容还是比较简单清晰的。

这里需要注意的是：CPU的内存和GPU的内存（显存）是隔离的，对应的地址空间是完全不同的。一个指针地址如果指向的是内存中的地址空间，在CPU中可以正常访问，如果直接传递给GPU访问，那么GPU访问的是显存中对应的地址空间，这是完全不同的两个地址！