• Project List
  • Resume and Contact
  • Graduate School Transcript

Parallelization of Computational Fluid Dynamics Simulator

The Original Problem

Over the course of a semester,  three classmates, our professor, and I worked together to create a two-dimensional atmospheric fluid simulator. As part of a class assignment, I modified my version of the code to simulate atmospheric gravity waves caused by air-flow over a mountain. The results looked pretty cool, and I am certainly more aware of actual gravity waves visible through cloud formations.  
Picture
Figure: ​Simulated Atmospheric Gravity Waves.
Picture
Figure: Gravity waves visible through clouds over British Columbia. Original photo found here. No author listed.
A problem with the simulator is its speed and scalability. For example, the graphs above required eight minutes to generate; doubling the number of calculations in the x and y directions to achieve a higher fidelity would increase that time to several hours! 

​I wanted a program that could handle finer grid sizes, smaller time steps, bigger spatial domains, and increased complexity without requiring me to wait hours for the results. One solution is to parallelize the simulator program by dividing the problem into smaller pieces and divvying out those pieces to multiple processors. The process of program parallelization and metrics of the parallel program's execution are summarized below. 

A Summary of the Simulator

The parallelization scheme of any given program depends on the structure of that program. A brief summary of the original simulator program is apt. 
​
In the simulator, the user specifies four spatial parameters: the height and width of the physical domain simulated and the size of the computational grids in the 
x and y directions,  ​ΔX and ΔY. At execution, the program generates a map of coordinate points within the specified height and width and at equally spaced at intervals of ΔX and ΔY. These points are referred to as the interior domain in the literature and are illustrated in the figure below, most prominently in frames 1 and 2. The program also generates a set of points immediately adjacent to the boundaries of the interior domain. These points are referred to as the boundary domain. They are used to make sure that the solutions calculated within the interior domain remain physically viable. 
Picture
Figure: Animated illustration of physical domain generation. In this example, the interior domain covers a space of 14ΔY by 40ΔX. Solutions are only valid in the interior domain.
The program models the atmosphere with Euler's equations. After it generates a physical domain, the program proceeds to integrate these equations using a dimensionally-split 2-step Lax-Wendroff (LxW) method. This integration updates the air density, momentum, and energy in the interior domain at each time step. The boundary domain points are updated separately at each time step using a series of boundary-condition functions. This will be noted in some more detail in the next section. Integration is iterated until an user-specified stop-time is reached.
​
​A version of the program created in class may be found on Dr. J. Snively's course webpage. It does not include a mountain-induced gravity wave generator. It is coded in MATLAB. My parallel implementation is coded in Fortran.

Parallelization: Domain Division

Parallelization was achieved by dividing the physical domain into sections and then assigning each one to a processor, effectively creating smaller versions of the original problem across multiple processors.

As parts 1 and 2 of the figure to the right illustrate, if there are four processors, the domain is split into four pieces along the horizontal. Each processor receives a piece, and adds new boundary domains. The original problem had one processor integrating over a 
40ΔY by 14ΔX interior domain. The parallelized problem has four processors, each integrating over a 10ΔY by 14ΔX interior domain. If the original interior domain required 8 milliseconds to integrate, the parallelized domain will require something closer to 2 ms. 

Parallelization always comes with some computational overhead from communications between processors. The scheme applied in this project also comes with computational fees in the form of additional boundary domains. Noted above, the boundary domain is used to assure that laws of physics are followed within the interior domain. The boundary domains are of particular importance in the parallelization arrangement. The ghost cells along the vertical boundaries are used to communicate physical quantities between processors. This is summarized in the next section.
Illustration of domain splitting
Figure: Example of domain parallelization. The interior domain is divided into equally-sized sections, and then each processor takes a section.

Parallelization: Boundary Communication

The LxW method updates the fluid-property values at each point in the interior domain based on the values at neighboring points. For example, the point on the bottom-right edge of processor one's interior domain will need to "know" the values of the point on the bottom-left edge of processor two's interior domain. However, processors do not share memory by default; therefore, parallelization of the problem requires additional code commanding each processor to communicate some of its values to neighboring processors. This communication is enabled by Open MPI functions MPI_SEND and MPI_RECV.

​Boundary-information sharing requires that each processor send and receive information. To avoid stalling issues that can arise from MPI send and receive commands, my parallel program includs a scheme to unambiguously define what processor was sending or receiving at any moment based on the parity of each processor's identification number. Pseudocode of the boundary sharing commands is provided below. Note that "right edge info" refers to the density, momentum, and energy values within 2ΔX spaces of the right edge. 
Pseudocode of boundary sharing programming
temp_ID = my_processor_ID
do iteration = 1,2
        if temp_ID is even
                  send right edge info to (my_processor_ID+1)    
                  send left edge info to (my_processor_ID-1)    
        else
                  receive left edge info from (my_processor_ID-1)
                  receive right edge info from (my_processor_ID+1)
        endif
        temp_ID = tempID+1

enddo
The parity-based send-receive arrangement is flawed in that it limits the program to even numbered processors. This limitation was particularly obnoxious when I was initially testing the program on a four-core cluster. Certainly, better methods of avoiding an MPI stalling exist. Whenever I learn them, I'll use them.
Picture
Figure: Animated illustration of boundary information sharing between processors. Frame (1) identifies the left and right edges of each processor's interior domain, colored green, yellow, blue, and purple. It also identifies the starting boundary domains, colored gray. (2) shows the even processors sending their right edge information to their right neighbors.  (3) shows the even processors sending left edge information to their left neighbors. (4) and (5) show a repeat of (2) and (3) only with the odd processors sending the information. Notice that the initial boundary domain information is completely replaced by the end of (5). This boundary communication process is repeated after every integration iteration.    

Performance of Parallel Program: Timings

My goal in parallelizing the atmospheric fluid dynamics simulator was to create a faster and more scalable program. ​These may be metricized through program speedup and efficiency, quantities that require program execution timings. To obtain timings, the program was compiled and executed on a 16-core Linux cluster hosted and maintained by the Embry-Riddle Physical Science department. The following processor-number, width, and height combinations were used:
  • Number of processors: 2, 4, 6, ... 16
  • Width (x-range): 120k, 140k, 160k, ... 260k meters
  • Height (y-range): 10k, 60k, 110k, 160 k meters
Each combination was repeated twenty times for a total of 5120 timing measurements. The average run time for each combination is plotted in the following figure slide show.
Figure: Average Program Execution Times.  
The recorded times indicate that parallelization made the simulator faster. This indicates at least one level of success in meeting my main objective. 
​
There are certainly some irregularities. A quick scan through the figures shows that many of the timings do not decrease with an increase of processors. In the 160,000 km x-domain simulation, the 6-processor computation time is approximately the same for the top two curves is nearly the same despite the fact that the top curve required a greater number of computations and communications.

​The average run time for each combination may also be plotted in a three-dimensional space, as the following figures show. I used MATLAB's curve fitting tool to create a surface interpolation of the timing data. This, I think, helps visualize the relation between all variables. 
​
Figure: Measured and Interpolated Timing Results. The first four graphs show the measured timing data (represented by blue dots) and a surface generated from an interpolation of the data. ​The last two graphs only show the generate surface, however, these graphs compile all timing surfaces to create a layer of surfaces.

​Performance and Scalability: Analysis

Timings show that the program runs faster with more processors.
\[speedup=\frac{T_{serial}}{T_{parallel}},\quad\text{and}\quad efficiency=\frac{speedup}{NP}\]
Figure: Parallel Program Speedup.
Figure: Parallel Program Efficiency.
Site powered by Weebly. Managed by gen.xyz
  • Project List
  • Resume and Contact
  • Graduate School Transcript