nvprof is a tool that is part of the CUDA Toolkit that allows profiling of processes that use CUDA. With it, you can view “kernel execution, memory transfers, memory set and CUDA API calls and events or metrics for CUDA kernels” (source).

Normally, I would use the visual profiler (nvvp) to view the results in a timeline, but often times I just want to collect some specific metrics (e.g. the number of kernels executed, or the total running time of the session), where opening the output file in nvvp feel cumbersome.

Our Sample Program

We test this with the following CUDA program. It’s a simple program that adds two arrays, where the size of the array doubles every time, and the we would output CUDA takes to execute this kernel.

#include <iostream>
#include <algorithm>
#include <chrono>
#include <cuda.h>
#include <cuda_profiler_api.h>

void add(int n, int *a, int *b, int *c) {
    for (int i = 0; i < n; i++)
        c[i] = a[i] + b[i];

void test(int N) {
    // Allocate arrays A, B, C each with size N
    // Populate A and B with numbers
    add<<<1, 1>>>(N, A, B, C);
    // Free arrays A, B, and C

int main() {
    for (int i = 15; i <= 25; ++i) test(1 << i);


After compiling our test program with nvcc, we can profile it with the nvprof command:

nvprof --export-profile profile.nvvp ./a.out

We can try to view the profile.nvvp file, and we would see something like the following:


Note that, one the bottom, we can see the progressively larger calls to add().

Taking a Closer Look

Now, if we run binwalk on the profiler output, it gives us some clues about what the .nvvp file contains.

[/tmp]$ binwalk profile.nvvp

0             0x0             SQLite 3.x database,
352012        0x55F0C         Ubiquiti firmware header, third party, ~CRC32: 0x0, version: "MP^CREATE TABLE CUPTI_ACTIVITY_KIND_OPENMP(_id_ INTEGER PRIMARY KEY AUTOINCREMENT, eventKind INT NOT NULL, version INT NOT NULL,"
352052        0x55F34         Ubiquiti firmware header, third party, ~CRC32: 0x0, version: "MP(_id_ INTEGER PRIMARY KEY AUTOINCREMENT, eventKind INT NOT NULL, version INT NOT NULL, threadId INT NOT NULL, start INT NOT NU"

Hmm, interesting. What would happen if we try to open this with a SQLite viewer? We get tables!


We can see that the kernels executed are in the table CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL. We can then create a query to view the durations of each of the kernels ran:

SELECT _id_, start, end, (end-start) AS duration FROM CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL;

We get the following table: (here we can see duration is increasing roughly in powers of 2)


Looking at the other tables, we can get similar information about memory copies, runtime API calls, driver API calls and so on.

I find this pretty useful since now I can programmatically extract kernel information from nvvp files. However, since nvprof might soon be deprecated, I wonder if I can do similar things with Nsight Compute. Perhaps I will take a look later.