Questions about CUDA latency hiding mechanism and shared memory

I understand that to make a CUDA program efficient, we need to launch enough threads to hide the latency of expensive operations, such as global memory reads. For example, when a thread needs to...

When to call cudaDeviceSynchronize?

when is calling to the cudaDeviceSynchronize function really needed?. As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call...

How to avoid bank conflicts when loading data from global to shared memory

A problem involves strided accesses to an unsigned char array stored in global memory of a compute capability 1.3 GPU. In order to bypass the coalescence requirements of the global memory, the...

CUDA __syncthreads() compiles fine but is underlined with red

I have been working with CUDA 4.2 for a week now and I have a little problem. When I write the __syncthreads() function it becomes underlined and looks like it is wrong... Then if I put the mouse...

How can I flush GPU memory using CUDA (physical reset is unavailable)

My CUDA program crashed during execution, before memory was flushed. As a result, device memory remained occupied. I'm running on a GTX 580, for which nvidia-smi --gpu-reset is not...

CUDA function pointers

I was trying to make somtehing like this (actually I need to write some integration functions) in CUDA #include <iostream> using namespace std; float f1(float x) { return x * x; } float...

Does cudaFree after asynchronous call work?

I want to ask whether calling to cudaFree after some asynchronous calls is valid? For example int* dev_a; // prepare dev_a... // launch a kernel to process dev_a...

How do I get CUDA's printf to print to an arbitrary stream?

CUDA's printf() in kernels prints to the standard output stream of my process. Now, I want to, at the least, redirect this printout to an arbitrary output stream , from here on. I do mean an...

Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

Cuda 7.5 supports 16 bit floating point variables. Can anyone provide sample code demonstrating the use of it?

CUDA ERROR: initialization error when using parallel in python

I use CUDA for my code, but it still slow run. Therefore I change it to run parallel using multiprocessing (pool.map) in python. But I have CUDA ERROR: initialization error This Is function : def...

Failed to create CUBLAS handle. Tensorflow interaction with OpenCV

I am trying to use a PlayStation Eye Camera for a deep reinforcement learning project. The network, TensorFlow installation (0.11) and CUDA (8.0) are functional because I have been able to train...

How can I reset the CUDA error to success with Driver API after a trap instruction?

I have a kernel, which might call asm("trap;") inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it. In CUDA Runtime API, we can use...

MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen

I am using tensor flow version : 0.12.1 Cuda tool set version is 8. lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0 As documented here I have downloaded and installed...

How can reduce the number of branches/if statements in my kernels?

Is there a smart way to reduce the number of the if statements inside a CUDA's kernel? I am writing an application that will calculate a many-body Hamiltonian (simulation of a quantum system)....

ValueError: Dimension (-1) must be in the range [0, 2) in Keras

Suddenly I have this error with kears with tensorflow backend (python2.7) , same error with every code. I thought its keras 1 and 2 incompatibility but it was not Dimension (-1) must be in the...

__saturatef() intrinsic has no double-precision equivalent

Cuda supports intrinsic functions. Some map to device instructions, like fused multiply-adds, that cannot be expressed in normal syntax. Others are approximations that supposed to be faster...

How to make Jupyter Notebook to run on GPU?

In Google Collab you can choose your notebook to run on cpu or gpu environment. Now I have a laptop with NVDIA Cuda Compatible GPU 1050, and latest anaconda. How to have similiar feature to the...

Conv2D for GPU is not currently supported without cudnn

I'm testing a TensorFlow program which used tf.nn.conv2d, but an error occurs as below: 2018-09-18 01:33:54.908161: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports...

How to use ffmpeg on hardware acceleration with multiple inputs?

I'm trying to speed up the rendering of a video by using the GPU instead of the CPU. This code works, but I don't know if I'm doing it correctly. ffmpeg -hwaccel cuvid -c:v hevc_cuvid \ -i...

How to avoid "CUDA out of memory" in PyTorch

I think it's a pretty common message for PyTorch users with low GPU memory: RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already...

Cuda: XOR single bitset with array of bitsets

I want to XOR a single bitset with a bunch of other bitsets (~100k) and count the set bits of every xor-result. The size of a single bitset is around 20k bits. The bitsets are already converted to...

How to allocate OpenCV Mat/Image on CUDA pinned memory?

So I'm using OpenCV cv::Mat to read/write file. But since they allocate using normal memory, transfering data to the GPU is slow. Is there any way to make OpenCV use pinned memory (cudaMallocHost...

Dlib ImportError in Windows 10 on line _dlib_pybind11 import *, DLL Load Failed

I am able to successfully install Dlib with CUDA support in Windows 10 but getting an error during "import dlib" in my python code of computer vision project. Environment: Windows 10, Python 3.7.6...

Unable to install tensorflow using conda with python 3.8

Recently, I upgraded to Anaconda3 2020.07 which uses python 3.8. In past versions of anaconda, tensorflow was installed successfully. Tensorflow failed to be installed successfully in this...

extract a continuous OpenCV cuda::GpuMat?

I'm trying to apply thrust algorithms to the data in cuda::GpuMats. Unfortunately OpenCV basically never produces continuous GpuMats (which screws up virtually all my algorithms, code,...

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch

I am trying to run a simple pytorch sample code. It's works fine using CPU. But when using GPU, i get this error message: Traceback (most recent call last): File "<stdin>", line 1, in <module> ...

CUDA_ERROR_NOT_INITIALIZED on A100 after server reset

I'm running on a server with a A100 GPU. When trying to run tensorflow code after a server reset, tensorflow does not recognize the GPU. Running tf.config.list_physical_devices('GPU') yields...

Qt Creator display in application output: NVD3DREL: GR-805 : DX9 Overlay is DISABLED

As I am working with my project, I noticed that when I run my app, inside the Application Output area I can see message: NVD3DREL: GR-805 : DX9 Overlay is DISABLED NVD3DREL: GR-805 : DX9 Overlay...

Add packages from requirements.txt to Docker image to minimize cold start time on EC2?

When deploying a machine learning model on EC2 from a Docker image, the cold start time is high because the instance downloads the packages and files from requirements.txt even though the...

Is CPU to GPU data transfer slow in TensorFlow?

I've tested CPU to GPU data transfer throughput with TensorFlow and it seems to be significantly lower than in PyTorch. For large tensors between 2x and 5x slower. In TF, I reach maximum speed for...