.. _libc_gpu_usage: =================== Using libc for GPUs =================== .. contents:: Table of Contents :depth: 4 :local: Using the GPU C library ======================= Once you have finished :ref:`building` the GPU C library it can be used to run libc or libm functions directly on the GPU. Currently, not all C standard functions are supported on the GPU. Consult the :ref:`list of supported functions` for a comprehensive list. The GPU C library supports two main usage modes. The first is as a supplementary library for offloading languages such as OpenMP, CUDA, or HIP. These aim to provide standard system utilities similarly to existing vendor libraries. The second method treats the GPU as a hosted target by compiling C or C++ for it directly. This is more similar to targeting OpenCL and is primarily used for exported functions on the GPU and testing. Offloading usage ---------------- Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source file for both the host target and a list of offloading devices. In order to support standard compilation flows, the ``clang`` driver uses fat binaries, described in the `clang documentation `_. This linking mode is used by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags. The installation should contain a static library called ``libcgpu-amdgpu.a`` or ``libcgpu-nvptx.a`` depending on which GPU architectures your build targeted. These contain fat binaries compatible with the offloading toolchain such that they can be used directly. .. code-block:: sh $> clang opnemp.c -fopenmp --offload-arch=gfx90a -lcgpu-amdgpu $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -lcgpu-nvptx $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -lcgpu-amdgpu This will automatically link in the needed function definitions if they were required by the user's application. Normally using the ``-fgpu-rdc`` option results in sub-par performance due to ABA linking. However, the offloading toolchain supports the ``--foffload-lto`` option to support LTO on the target device. Offloading languages require that functions present on the device be declared as such. This is done with the ``__device__`` keyword in CUDA and HIP or the ``declare target`` pragma in OpenMP. This requires that the LLVM C library exposes its implemented functions to the compiler when it is used to build. We support this by providing wrapper headers in the compiler's resource directory. These are located in ``/include/llvm-libc-wrappers`` in your installation. The support for HIP and CUDA is more experimental, requiring manual intervention to link and use the facilities. An example of this is shown in the :ref:`CUDA server example`. The OpenMP Offloading toolchain is completely integrated with the LLVM C library however. It will automatically handle including the necessary libraries, define device-side interfaces, and run the RPC server. OpenMP Offloading example ^^^^^^^^^^^^^^^^^^^^^^^^^ This section provides a simple example of compiling an OpenMP program with the GPU C library. .. code-block:: c++ #include int main() { FILE *file = stderr; #pragma omp target teams num_teams(2) thread_limit(2) #pragma omp parallel num_threads(2) { fputs("Hello from OpenMP!\n", file); } } This can simply be compiled like any other OpenMP application to print from two threads and two blocks. .. code-block:: sh $> clang openmp.c -fopenmp --offload-arch=gfx90a $> ./a.out Hello from OpenMP! Hello from OpenMP! Hello from OpenMP! Hello from OpenMP! Including the wrapper headers, linking the C library, and running the :ref:`RPC server` are all handled automatically by the compiler and runtime. Binary format ^^^^^^^^^^^^^ The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each supported target device. The supported architectures can be seen using LLVM's ``llvm-objdump`` with the ``--offloading`` flag: .. code-block:: sh $> llvm-objdump --offloading libcgpu-amdgpu.a libcgpu-amdgpu.a(strcmp.cpp.o): file format elf64-x86-64 OFFLOADING IMAGE [0]: kind llvm ir arch generic triple amdgcn-amd-amdhsa producer none ... Because the device code is stored inside a fat binary, it can be difficult to inspect the resulting code. This can be done using the following utilities: .. code-block:: sh $> llvm-ar x libcgpu.a strcmp.cpp.o $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc $> opt -S out.bc ... Please note that this fat binary format is provided for compatibility with existing offloading toolchains. The implementation in ``libc`` does not depend on any existing offloading languages and is completely freestanding. Direct compilation ------------------ Instead of using standard offloading languages, we can also target the CPU directly using C and C++ to create a GPU executable similarly to OpenCL. This is done by targeting the GPU architecture using `clang's cross compilation support `_. This is the method that the GPU C library uses both to build the library and to run tests. This allows us to easily define GPU specific libraries and programs that fit well into existing tools. In order to target the GPU effectively we rely heavily on the compiler's intrinsic and built-in functions. For example, the following function gets the thread identifier in the 'x' dimension on both GPUs supported GPUs. .. code-block:: c++ uint32_t get_thread_id_x() { #if defined(__AMDGPU__) return __builtin_amdgcn_workitem_id_x(); #elif defined(__NVPTX__) return __nvvm_read_ptx_sreg_tid_x(); #else #error "Unsupported platform" #endif } We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the following commands. This will yield valid LLVM-IR for the given target just like if we were using CUDA, OpenCL, or OpenMP. .. code-block:: sh $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c We can also use this support to treat the GPU as a hosted environment by providing a C library and startup object just like a standard C library running on the host machine. Then, in order to execute these programs, we provide a loader utility to launch the executable on the GPU similar to a cross-compiling emulator. This is how we run :ref:`unit tests ` targeting the GPU. This is clearly not the most efficient way to use a GPU, but it provides a simple method to test execution on a GPU for debugging or development. Building for AMDGPU targets ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The AMDGPU target supports several features natively by virtue of using ``lld`` as its linker. The installation will include the ``include/amdgcn-amd-amdhsa`` and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use the library. We can directly link against ``libc.a`` and use LTO to generate the final executable. .. code-block:: c++ #include int main() { fputs("Hello from AMDGPU!\n", stdout); } This program can then be compiled using the ``clang`` compiler. Note that ``-flto`` and ``-mcpu=`` should be defined. This is because the GPU sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help`` for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs if present. Additionally, the AMDGPU target always uses ``-flto`` because we currently do not fully support ELF linking in ``lld``. Once built, we use the ``amdhsa-loader`` utility to launch execution on the GPU. This will be built if the ``hsa_runtime64`` library was found during build time. .. code-block:: sh $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc /lib/amdgcn-amd-amdhsa/crt1.o $> amdhsa-loader --threads 2 --blocks 2 a.out Hello from AMDGPU! Hello from AMDGPU! Hello from AMDGPU! Hello from AMDGPU! This will include the ``stdio.h`` header, which is found in the ``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o`` will handle the necessary steps to execute the ``main`` function along with global initializers and command line arguments. Finally, we link in the ``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C functions. The search paths for the include directories and libraries are automatically handled by the compiler. We use this support internally to run unit tests on the GPU directly. See :ref:`libc_gpu_testing` for more information. The installation also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be used instead of the static library. Building for NVPTX targets ^^^^^^^^^^^^^^^^^^^^^^^^^^ The infrastructure is the same as the AMDGPU example. However, the NVPTX binary utilities are very limited and must be targeted directly. There is no linker support for static libraries so we need to link in the ``libc.bc`` bitcode and inform the compiler driver of the file's contents. .. code-block:: c++ #include int main(int argc, char **argv, char **envp) { fputs("Hello from NVPTX!\n", stdout); } Additionally, the NVPTX ABI requires that every function signature matches. This requires us to pass the full prototype from ``main``. The installation will contain the ``nvptx-loader`` utility if the CUDA driver was found during compilation. .. code-block:: sh $> clang hello.c --target=nvptx64-nvidia-cuda -march=native \ -x ir /lib/nvptx64-nvidia-cuda/libc.bc \ -x ir /lib/nvptx64-nvidia-cuda/crt1.o $> nvptx-loader --threads 2 --blocks 2 a.out Hello from NVPTX! Hello from NVPTX! Hello from NVPTX! Hello from NVPTX!