This guide provides a detailed discussion of the CUDA programming model and programming interface. It then describes the hardware
implementation, and provides guidance on how to achieve maximum performance. The Appendixes include a list of all CUDA-enabled
devices, detailed description of all extensions to the C language, listings of supported mathematical functions, C++ features
supported in host and device code, details on texture fetching, technical specifications of various devices, and concludes
by introducing the low-level driver API.
This guide presents established parallelization and optimization techniques and explains coding metaphors and idioms that
can greatly simplify programming for CUDA-capable GPU architectures. The intent is to provide guidelines for obtaining the
best performance from NVIDIA GPUs using the CUDA Toolkit.
This application note is intended to help developers ensure that their NVIDIA CUDA applications will run effectively on GPUs
based on the NVIDIA Kepler Architecture. This document provides guidance to ensure that your software applications are compatible
with Kepler.
Kepler is NVIDIA's next-generation architecture for CUDA compute applications. Applications that follow the best practices
for the Fermi architecture should typically see speedups on the Kepler architecture without any code changes. This guide summarizes
the ways that an application can be fine-tuned to gain additional speedups by leveraging Kepler architectural features.
This document provides guidance on how to design and develop software that takes advantage of the new Dynamic Parallelism
capabilities introduced with CUDA 5.0.
This guide provides detailed instructions on the use of PTX, a low-level parallel thread execution virtual machine and instruction
set architecture (ISA). PTX exposes the GPU as a data-parallel computing device.
The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows
the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across
multiple GPUs.
This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes
each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.
This document provides instructions for installing the NVIDIA CUDA Toolkit and NVIDIA CUDA Samples, as well as guidelines
for creating your own CUDA projects. It includes chapters on known issues as well as an FAQ. It's written for Windows, Linux,
and Mac OS.
This document is intended to introduce a set of samples that can be run as an introduction to CUDA. Most of these samples
use the CUDA runtime API except for ones explicitly noted that are CUDA Driver API.
This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation
driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for
defining macros and include/library paths, and for steering the compilation process.
The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging
CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.
CUDA-MEMCHECK is a suite of run time tools capable of precisely detecting
out of bounds and misaligned memory access errors, checking device
allocation leaks, reporting hardware errors and identifying shared memory data
access hazards.
A tool for Kepler-class GPUs and CUDA 5.0 enabling a direct path for communication between the GPU and a peer
device on the PCI Express bus when the devices share the same upstream root complex using standard features
of PCI Express. This document introduces the technology and describes the steps necessary to enable a
RDMA for GPUDirect connection to NVIDIA GPUs within the Linux device driver model.