Initial CUDA Performance Surprises
I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.
Memory Coalescing
If you have multiple threads operating on an array in C++, you probably want to iterate like this:
std::vector<T> vec = ...;
size_t per_thread = vec.size() / num_threads;
T * my_slice = vec.data() + per_thread * my_thread_i;
for (size_t i = 0; i < per_thread; ++i) {
do_something(my_slice[i]);
}
Meaning each thread iterates over a contiguous chunk of memory. In CUDA this is going to be slow because you want the threads to load memory together. So if thread 0 loads bytes 0 to 15, then you want thread 1 to load bytes 16 to 31 and thread 2 to load bytes 32 to 47 etc. So the loop instead has to look like this:
T * data = ...;
size_t num_elements = ...;
for (int i = my_thread_i; i < num_elements; i += num_threads) {
do_something(data[i]);
}
This is called “memory coalescing” where adjacent threads use adjacent memory. On a loop with a small body (dot product) this is 3x faster.
Read the rest of this entry »