NVIDIA Advanced CUDA Programming Course Plan

1. From GPU to GPGPU
- Performance and parallelism
- GPU evolution
- Parallel systems: multicore and clustering
2. CUDA programming model
- Key principles
- Threads and blocks
- Language extensions
- Attributes
- Builtin types and variables
- Kernel invocation operator
- CUDA runtime API
- Asynchronous execution
- Handling runtime errors in CUDA
- Querying GPU capabilities
3. Memory hierarchy
- Global memory
- Example: matrix multiplication
- Optimizing global memory usage
- Block-shared memory
- Example: matrix multiplication
- Shared memory access patterns
- Constant memory
- Texture memory
- Unified virtual address space (UVA)
4. Implementing basic data processing
- Parallel reduction
- Prefix sum (scan)
- CUDA implementation
- CUDPP implementation
5. CUDA Libraries
- CUBLAS
- CUSPARSE
- CUFFT
- CURAND
6. CUDA Fortran Overiew
7. Using multiple GPUs
- CUDA context
- fork
- MPI
- POSIX-threads
- OpenMP
- Boost.Threads
8. CUDA Streams
- Example: concurrent kernels execution
- Example: matrix multiplication
- Example: Multi-GPU Async Copy
9. Debugging
- Principles and terminology
- gdb
- cuda-gdb
- Nsight
- CUDA (Visual) Profiler
- cuda-memcheck
10. OpenCL Overview
- Simple example
- OpenCL host API
- Developing and deploying OpenCL kernels
- Comparison with CUDA
11. Optimization Techniques
Hands-ons
- Parallel sine function computation
- Matrix-matrix multiply with shared memory