What happens when you run a CUDA kernel?

(fergusfinn.com)

134 points | by mezark 5 hours ago

5 comments

  • mschuetz 2 hours ago

    That was an interesting read. Also enjoyed reading about the semaphores in the default stream. It's great that cuda implicitly handles syncing of commands for users and makes parallel commands optional and opt-in via streams, unlike Vulkan which completely unloads the full complexity of syncing to users right from the start.

    • fooblaster 4 hours ago

      The hardware has some open documentation. You don't actually need to read the kernel source to find some of the method documentation or qmd formats. See https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/c...

      • orliesaurus 3 hours ago

        There are companies whose whole job right now is to optimize kernels so that things run faster. I wonder if those companies are going to be dethroned by some sort of like open source library that can do that really well (I bet Nvidia could release it any day.).. or if they're going to thrive and be acquired by the big providers as a `moat` to speed up their infrerence.

        • spmurrayzzz 2 hours ago

          Near-term acquihires are certainly a likely bet I think. But given model progress on related benchmarks like kernelbench [1], I do think a set of more commoditized solutions is also inevitable.

          The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.

          [1] https://kernelbench.com/

          • connicpu 59 minutes ago

            When you run CUDA at scale dealing with nvidia driver and library bugs takes up a disgustingly large percentage of engineer time, I don't know a lot of people who would be looking forward to rely on more nvidia libraries.

            • orliesaurus 10 minutes ago

              fair point, but are there alternatives that aren't CUDA locked?

            • einpoklum 49 minutes ago

              Probably not, because the specifics of the workload - exact parameters, representation of data in memory, value ranges etc - lead you to highly divergent optimization strategies.

              • orliesaurus 9 minutes ago

                shouldn't it be possible to be run as a mlautoresearch project? i.e. orchestrate 10 strategies to speed it up, run in paralellel, pick the winning and go from there?

            • kinow 1 hour ago

              I just finished a master's on HPC where I had to take some classes on CUDA, MPI+CUDA, OpenCL. Reading an article like this before the classes would have been a lot helpful! Especially the part just before and after "What does it mean for a warp to be eligible?".

              • einpoklum 4 hours ago

                First - nice writeup which goes into a lot of nooks and crannies.

                That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:

                https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...

                but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:

                https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...

                that's a sample program for my CUDA API wrappers (header-only) library.

                • mschuetz 3 hours ago

                  I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders. It's fun to develop while being able to change the code at runtime.

                  • einpoklum 51 minutes ago

                    > I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders.

                    It is also much more friendly for library authors; and easier to wrap; and actually exposes a bunch of features the "runtime API" doesn't.

                    The difficulty with it is that there just so many API calls; dozens of calls just for copying, for example. That was part of my motivation for writing my wrappers - making the supposedly "lower-level" API more accessible and intuitive than the supposedly "higher-level" API; and better integrated with the other libraries: NVTX, NVRTC, PTX compiler, fatbin library etc.

                    > It's fun to develop while being able to change the code at runtime.

                    It's also _the_ way to debug your kernels: If you don't load them dynamically, you have to recompile your application or kernel test harness every time you make a change to the kernel.