Background

nvprof is a tool that is part of the CUDA Toolkit that allows profiling of processes that use CUDA. With it, you can view “kernel execution, memory transfers, memory set and CUDA API calls and events or metrics for CUDA kernels” (source).

Normally, I would use the visual profiler (nvvp) to view the results in a timeline, but often times I just want to collect some specific metrics (e.g. the number of kernels executed, or the total running time of the session), where opening the output file in nvvp feel cumbersome.

Our Sample Program

We test this with the following CUDA program. It’s a simple program that adds two arrays, where the size of the array doubles every time, and the we would output CUDA takes to execute this kernel.

#include <iostream>
#include <algorithm>
#include <chrono>
#include <cuda.h>
#include <cuda_profiler_api.h>

__global__
void add(int n, int *a, int *b, int *c) {
    for (int i = 0; i < n; i++)
        c[i] = a[i] + b[i];
}

void test(int N) {
    // Allocate arrays A, B, C each with size N
    // Populate A and B with numbers
    add<<<1, 1>>>(N, A, B, C);
    // Free arrays A, B, and C
}

int main() {
    cudaProfilerStart();
    for (int i = 15; i <= 25; ++i) test(1 << i);
    cudaProfilerStop();
}

Profiling

After compiling our test program with nvcc, we can profile it with the nvprof command:

nvprof --export-profile profile.nvvp ./a.out

We can try to view the profile.nvvp file, and we would see something like the following:

nvprof1.png

Note that, one the bottom, we can see the progressively larger calls to add().

Taking a Closer Look

Now, if we run binwalk on the profiler output, it gives us some clues about what the .nvvp file contains.

[/tmp]$ binwalk profile.nvvp

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             SQLite 3.x database,
352012        0x55F0C         Ubiquiti firmware header, third party, ~CRC32: 0x0, version: "MP^CREATE TABLE CUPTI_ACTIVITY_KIND_OPENMP(_id_ INTEGER PRIMARY KEY AUTOINCREMENT, eventKind INT NOT NULL, version INT NOT NULL,"
352052        0x55F34         Ubiquiti firmware header, third party, ~CRC32: 0x0, version: "MP(_id_ INTEGER PRIMARY KEY AUTOINCREMENT, eventKind INT NOT NULL, version INT NOT NULL, threadId INT NOT NULL, start INT NOT NU"

Hmm, interesting. What would happen if we try to open this with a SQLite viewer? We get tables!

nvprof2.png

We can see that the kernels executed are in the table CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL. We can then create a query to view the durations of each of the kernels ran:

SELECT _id_, start, end, (end-start) AS duration FROM CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL;

We get the following table: (here we can see duration is increasing roughly in powers of 2)

nvprof3.png

Looking at the other tables, we can get similar information about memory copies, runtime API calls, driver API calls and so on.

I find this pretty useful since now I can programmatically extract kernel information from nvvp files. However, since nvprof might soon be deprecated, I wonder if I can do similar things with Nsight Compute. Perhaps I will take a look later.