CUDA 7 option: default-stream per-thread
Since October 2016 the cuda/8.0 module is available on the HPC system.
This version of the CUDA toolkit includes some new features that improves the HPC integration when using CUDA C, OpenMP, and MPI together. E.g., a new option to use an independent default stream for every host thread, which avoids the serialization of the legacy default stream. The option allows a nicer interaction between OpenMP and CUDA, where one can seamlessly launch GPU kernels and memory transfers to run concurrently using standard OpenMP pragmas.
The nvcc option is: –default-stream per-thread. The option forces every OpenMP thread to use its own stream even though you only launch kernels and memory transfers to the default stream 0.
More details and examples can be found here: GPU Pro Tip.