calling global function inside a kernel CUDA
I'm trying to write a CUDA kernel function that contains Matrix
multiplication. like:
__device__ Matrix_Multi(Matrix A,Matrix B,Matrix C);
__global__ void foo(type para){
....
Matrix_Multi(Matrix A,Matrix B,Matrix C);
....
}
I want to accelerate the matrix multiplication operation. I have two choices.
first, using Cublas library. second, write a kernel for matrix
multiplication and call it inside foo().
I failed in both cases. Can anyone help?
No comments:
Post a Comment