c - OpenMP optimizations? -


i can't figure out why performance of function bad. have core 2 duo machine , know creating 2 trheads not issue of many threads. expected results closer pthread results.

these compilation flags (purposely not doing optimization flags) gcc -fopenmp -lpthread -std=c99 matrixmul.c -o matrixmul

these results

sequential matrix multiply: 2.344972 pthread    matrix multiply: 1.390983 openmp     matrix multiply: 2.655910 cuda       matrix multiply: 0.055871 pthread test passed openmp  test passed cuda    test passed 

void openmpmultiply(matrix* a, matrix* b, matrix* p) {   //int i,j,k;   memset(*p, 0, sizeof(matrix));   int   tid, nthreads, i, j, k, chunk;   #pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k)   {         tid = omp_get_thread_num();         if (tid == 0)         {           nthreads = omp_get_num_threads();         }         chunk = 20;         //   #pragma omp parallel private(i, j, k)         #pragma omp schedule (static, chunk)         for(i = 0; < height; i++)         {           //printf("thread=%d did row=%d\n",tid,i);                 for(j = 0; j < width; j++)                 {                         //#pragma omp parallel                         for(k = 0; k < kheight ; k++)                                 (*p)[i][j] += (*a)[i][k] * (*b)[k][j];                 }         }   } } 

thanks help.

as matrix multiplication embarrassingly parallel, speedup should near 2 on dual core. matrix multiplication typically shows superlinear speedup (greater 2 on dual core) due reduced cache misses. don't see obvious mistakes looking code, something's wrong. here suggestions:

  1. just double-check number of worker threads. in case, 2 threads should created. or, try set calling omp_set_num_threads. also, see whether 2 cores utilized (i.e., 100% cpu utilization on windows, 200% on linux).

  2. clean code removing unnecessary nthreads , chunk. these can prepared outside of parallel section. but, if so, shouldn't hurt speedup.

  3. are matrices square (i.e., height == width == kheight)? if it's not square matrix, there workload imbalance can hurt speedup. but, given speedup of pthread (around 1.6, odd me), don't think there's workload imbalance.

  4. try use default static scheduling: don't specify chunk, write #pragma omp for.

  5. my best guess structure of matrix problematic. matrix looks like? in worst case, false sharing hurt performance. but, in such simple matrix multiplication, false sharing shouldn't big problem. (if don't know detail, may explain more details).

  6. although commented out, never put #pragma omp parallel for @ for-k, causes nested parallel loop. in matrix multiplication, it's absolutely wasteful outer loop parallelizable.

finally, try run following simple openmp matrix multiplication code, , see speedup:

double a[n][n], b[n][n], c[n][n]; #pragma omp parallel for (int row = 0; row < n; ++row)   (int col = 0; col < n; ++col)     (int k = 0; k < n; ++k)       c[row][col] += a[row][k]*b[k][col]; 

Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -