gpu programming - Memory Error in CUDA Program for Fermi GPU -


i facing following problem on geforce gtx 580 (fermi-class) gpu.

just give background, reading single-byte samples packed in following manner in file: real(signal 1), imaginary(signal 1), real(signal 2), imaginary(signal 2). (each byte signed char, taking values between, -128 , 127.) read these char4 array, , use kernel given below copy them 2 float2 arrays corresponding each signal. (this isolated part of larger program.)

when run program using cuda-memcheck, either unqualified unspecified launch failure, or same message along user stack overflow or breakpoint hit or invalid __global__ write of size 8 @ random thread , block indices.

the main kernel , launch-related code reproduced below. the strange thing code works (and cuda-memcheck throws no error) on non-fermi-class gpu have access to. thing observed fermi gives no error n less 16384.

#define n   32768  int main(int argc, char *argv[]) {     char4 *pc4buf_h = null;     char4 *pc4buf_d = null;     float2 *pf2inx_d = null;     float2 *pf2iny_d = null;     dim3 dimbcopy(1, 1, 1);     dim3 dimgcopy(1, 1);     ...     /* check errors in actual code */     pc4buf_h = (char4 *) malloc(n * sizeof(char4));     (void) cudamalloc((void **) &pc4buf_d, n * sizeof(char4));     (void) cudamalloc((void **) &pf2inx_d, n * sizeof(float2));     (void) cudamalloc((void **) &pf2iny_d, n * sizeof(float2));     ...     dimbcopy.x = 1024;  /* number of threads in block, gpu */     dimgcopy.x = n / 1024;     copydataforfft<<<dimgcopy, dimbcopy>>>(pc4buf_d,                                            pf2inx_d,                                            pf2iny_d);     ... }  __global__ void copydataforfft(char4 *pc4data,                                float2 *pf2fftinx,                                float2 *pf2fftiny) {     int = (blockidx.x * blockdim.x) + threadidx.x;      pf2fftinx[i].x = (float) pc4data[i].x;     pf2fftinx[i].y = (float) pc4data[i].y;     pf2fftiny[i].x = (float) pc4data[i].z;     pf2fftiny[i].y = (float) pc4data[i].w;      return; } 

one other thing noticed in program if comment out 2 char-to-float assignment statements in kernel, there's no memory error. 1 other thing noticed in program if comment out either first 2 or last 2 char-to-float assignment statements in kernel, there's no memory error. if comment out 1 first 2 (pf2fftinx), , second 2 (pf2fftiny), errors still crop up, less frequently. kernel uses 6 registers 4 assignment statements uncommented, , uses 5 4 registers 2 assignment statements commented out.

i tried 32-bit toolkit in place of 64-bit toolkit, 32-bit compilation -m32 compiler option, running without x windows, etc. program behaviour same.

i use cuda 4.0 driver , runtime (also tried cuda 3.2) on rhel 5.6. gpu compute capability 2.0.

please help! post entire code if interested in running on fermi cards.

update: heck of it, inserted __syncthreads() between pf2fftinx , pf2fftiny assignment statements, , memory errors disappeared n = 32768. @ n = 65536, still errors. <-- didn't last long. still getting errors.

update: in continuing weird behaviour, when run program using cuda-memcheck, these 16x16 blocks of multi-coloured pixels distributed randomly on screen. not happen if run program directly.

the problem bad gpu card (see comments). [i'm adding answer remove question unanswered list , make more useful.]


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -