OpenMp Part-2: Understanding OpenMP Structure

Aug 19, 2020 - OpenMp Parallel Programming C

What is OpenMP?

OpenMP is an API or Application Programming Interface which supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. A shared-memory multiprocessing program refers to a program utilizing more then one thread which access a common memory address for read or write operations. OpenMP is the set of compiler directives and runtime libraries for parallel application programs.

Core syntax of OpenMP

As mentioned before most of the constructs in OpenMP are just compiler directives. So what are compiler directives ? A directive is a “pragma” statement which gives information to the compiler about how it should process the following block of code or input. Remember in the last module we used “#pragma omp parallel” and then started a code block with curly braces which encapsulated the print statement. This “#pragma” statement is a directive that gave information to the compiler to process the following blog of code in parallel threads using OpenMP

A general syntax for this directive is as follows:

#pragma omp construct [clause [clause] ... ]

To use OpenMP we would have to include it’s header file in our c program #include <omp.h>. This contains function prototypes and other types required to write parallel programs. Most of the OpenMP constructs such as parallel apply to a structure block. Structure block is a block of code containing one or more then one statements, which has exactly one point of entry at the top and one point of exit at the bottom.

Problems with Parallel Programing

Just using these directives provided by OpenMP in your code won’t help you much. Their is a initial amount of time where the developer of a program needs to decide what parts of their program needs and can to be parallelized. Why? Their are many reasons for this but we’ll be going through a few of them to understand what we need to do for creating an efficient parallel program.

Computation being Parallelized is to small: If the actual computation that needs to be performed is small, for example addition of two numbers then the extra time required to initialize and manage threads will be a overhead for your application. This is because the time required to initialize your parallel code block in your program may end up compensating for the benefits provided by parallelism. This happens when the computation performed by each thread takes shorter time then the initialization time of threads. For example lets say initialization time for a parallel code block is 80ms and the actual computation that needs to be done is just 40ms then making our program parallel would just make our problem compute longer.
Race condition due to shared memory: OpenMP uses shared address model, therefor threads read/write to a shared memory address space. Unintended sharing of data may result into incorrect and inconsistent results each time you run a program. This happens because threads run in parallel and try to access same memory address simultaneously for read/write operation. For example assume we have 2 threads and a shared address containing value 0. The program needs to add two numbers, 10 and 6 to the value stored in this shared memory address. As soon as both the threads are initialized they try to access the shared value simultaneously. By doing this both of them end up reading the same value from the shared memory which is 0. Now one of these threads adds 0 + 10 and stores it back in the shared memory. Mean while another thread also adds 0 + 6 and stores it back in the shared memory. Here the second thread essentially overrides the changes made by the first thread because of which our final answer stored in the shared memory is 6 and not 16. Such a undesirable condition is called Race condition.

Fork-Join Model

Before going straight into code we need to understand how parallelism takes place. The program starts with a single thread (often called Main thread). The main thread gets Forked into a number of threads that can execute concurrently at the same time. The main thread with the other forked threads are then refereed to as a Team of Threads. All these threads run in Parallel and once their computations are completed they merge/join back into the main thread. A forked thread can be considered as a main thread and be further forked into more threads.

We create threads in OpenMP by using the parallel construct

int i = 1;
#pragma omp parallel
{
    int Thread_ID = omp_get_thread_num();
    printf("Thread ID: %d\t Also value of I: %d", Thread_ID, i);
}

In the above snippet threads are created when the compiler reads the parallel construct. All variables declared outside the code block of parallel construct are made available to all the threads as they are present in the common heap memory of the program. Any variables declared inside the code block of parallel construct exist in the stack space of each thread and aren’t visible to other threads. Hence in the above example, all threads have access to variable i (shared memory), whereas each thread would contain it’s own version of Thread_ID which won’t be visible to other threads.

Let’s Code - Counter

We’ll create a simple program to increment a number

#include <stdio.h>
#include <omp.h>

int main () {
    int counter = 0;

    #pragma omp parallel 
    {
        int thread_num = omp_get_thread_num();
        int num_of_threads = omp_get_num_threads();
        counter = counter + 1;
        printf("Thread ID: %d \t Total number of threads %d \t Counter Value: %d\n", thread_num, num_of_threads, counter);
    }
}

omp_get_thread_num() - returns an int value representing the thread id
omp_get_num_threads() - returns an int value representing the total number of threads created by parallel construct

On running this program we see that it verifies some of the points discussed in this module. The variable counter exists in the shared memory space and hence is visible to all threads, on running the program multiple times we see execution of threads happen randomly and we can’t predict which thread will execute when, we also see multiple threads trying to update the shared memory leading to a racing condition. We can verify this because the number of times we add 1 to variable counter is equal to the number of threads available and hence the max value of counter that we get should be equal to num of threads. But that is not the case because the max value of counter that we get is always less then the number of threads.

Next Module we’ll see how we can control this race condition using synchronization