c - OpenMP optimizations? -

- May 15, 2013

i can't figure out why performance of function bad. have core 2 duo machine , know creating 2 trheads not issue of many threads. expected results closer pthread results.

these compilation flags (purposely not doing optimization flags) gcc -fopenmp -lpthread -std=c99 matrixmul.c -o matrixmul

these results

sequential matrix multiply: 2.344972 pthread    matrix multiply: 1.390983 openmp     matrix multiply: 2.655910 cuda       matrix multiply: 0.055871 pthread test passed openmp  test passed cuda    test passed

void openmpmultiply(matrix* a, matrix* b, matrix* p) {   //int i,j,k;   memset(*p, 0, sizeof(matrix));   int   tid, nthreads, i, j, k, chunk;   #pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k)   {         tid = omp_get_thread_num();         if (tid == 0)         {           nthreads = omp_get_num_threads();         }         chunk = 20;         //   #pragma omp parallel private(i, j, k)         #pragma omp schedule (static, chunk)         for(i = 0; < height; i++)         {           //printf("thread=%d did row=%d\n",tid,i);                 for(j = 0; j < width; j++)                 {                         //#pragma omp parallel                         for(k = 0; k < kheight ; k++)                                 (*p)[i][j] += (*a)[i][k] * (*b)[k][j];                 }         }   } }

thanks help.

as matrix multiplication embarrassingly parallel, speedup should near 2 on dual core. matrix multiplication typically shows superlinear speedup (greater 2 on dual core) due reduced cache misses. don't see obvious mistakes looking code, something's wrong. here suggestions:

just double-check number of worker threads. in case, 2 threads should created. or, try set calling omp_set_num_threads. also, see whether 2 cores utilized (i.e., 100% cpu utilization on windows, 200% on linux).
clean code removing unnecessary nthreads , chunk. these can prepared outside of parallel section. but, if so, shouldn't hurt speedup.
are matrices square (i.e., height == width == kheight)? if it's not square matrix, there workload imbalance can hurt speedup. but, given speedup of pthread (around 1.6, odd me), don't think there's workload imbalance.
try use default static scheduling: don't specify chunk, write #pragma omp for.
my best guess structure of matrix problematic. matrix looks like? in worst case, false sharing hurt performance. but, in such simple matrix multiplication, false sharing shouldn't big problem. (if don't know detail, may explain more details).
although commented out, never put #pragma omp parallel for @ for-k, causes nested parallel loop. in matrix multiplication, it's absolutely wasteful outer loop parallelizable.

finally, try run following simple openmp matrix multiplication code, , see speedup:

double a[n][n], b[n][n], c[n][n]; #pragma omp parallel for (int row = 0; row < n; ++row)   (int col = 0; col < n; ++col)     (int k = 0; k < n; ++k)       c[row][col] += a[row][k]*b[k][col];

Search This Blog

C A N B

c - OpenMP optimizations? -

Comments

Post a Comment

Popular posts from this blog

php - How can I edit my code to echo the data of child's element where my search term was found in, in XMLReader? -

javascript - Iterate over array and calculate average values of array-parts -

jQuery Ajax Render Fragments OR Whole Page -