c - OpenMP optimizations? -
i can't figure out why performance of function bad. have core 2 duo machine , know creating 2 trheads not issue of many threads. expected results closer pthread results.
these compilation flags (purposely not doing optimization flags) gcc -fopenmp -lpthread -std=c99 matrixmul.c -o matrixmul
these results
sequential matrix multiply: 2.344972 pthread matrix multiply: 1.390983 openmp matrix multiply: 2.655910 cuda matrix multiply: 0.055871 pthread test passed openmp test passed cuda test passed
void openmpmultiply(matrix* a, matrix* b, matrix* p) { //int i,j,k; memset(*p, 0, sizeof(matrix)); int tid, nthreads, i, j, k, chunk; #pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); } chunk = 20; // #pragma omp parallel private(i, j, k) #pragma omp schedule (static, chunk) for(i = 0; < height; i++) { //printf("thread=%d did row=%d\n",tid,i); for(j = 0; j < width; j++) { //#pragma omp parallel for(k = 0; k < kheight ; k++) (*p)[i][j] += (*a)[i][k] * (*b)[k][j]; } } } }
thanks help.
as matrix multiplication embarrassingly parallel, speedup should near 2 on dual core. matrix multiplication typically shows superlinear speedup (greater 2 on dual core) due reduced cache misses. don't see obvious mistakes looking code, something's wrong. here suggestions:
just double-check number of worker threads. in case, 2 threads should created. or, try set calling
omp_set_num_threads
. also, see whether 2 cores utilized (i.e., 100% cpu utilization on windows, 200% on linux).clean code removing unnecessary
nthreads
,chunk
. these can prepared outside of parallel section. but, if so, shouldn't hurt speedup.are matrices square (i.e., height == width == kheight)? if it's not square matrix, there workload imbalance can hurt speedup. but, given speedup of pthread (around 1.6, odd me), don't think there's workload imbalance.
try use default static scheduling: don't specify
chunk
, write#pragma omp for
.my best guess structure of
matrix
problematic.matrix
looks like? in worst case, false sharing hurt performance. but, in such simple matrix multiplication, false sharing shouldn't big problem. (if don't know detail, may explain more details).although commented out, never put
#pragma omp parallel for
@for-k
, causes nested parallel loop. in matrix multiplication, it's absolutely wasteful outer loop parallelizable.
finally, try run following simple openmp matrix multiplication code, , see speedup:
double a[n][n], b[n][n], c[n][n]; #pragma omp parallel for (int row = 0; row < n; ++row) (int col = 0; col < n; ++col) (int k = 0; k < n; ++k) c[row][col] += a[row][k]*b[k][col];
Comments
Post a Comment