Monday, May 2, 2011

CIS 565 - GPU Programming and Architecture Project

Optimization 2: Sparsity Optimization of the Conjugate Gradient Solver

It is interesting to note that the conjugate gradient solver needs to solve a system of equations in the form A p = d, where A is a very sparse matrix computed based on the boundary conditions of the underlying grid, p is unknown pressure vector needed to make the velocity field divergence-free, and d is the current velocity divergence of each cell. A is quadratically proportional to the grid size and it has at most 7 non-zero entries in each row and each column.

Therefore, I decided to take advantage of the sparsity of A in the conjugate gradient solver. Instead of doing large matrix-vector multiplications of matrix rows consisting of predominantly 0s, I decided to make the GPU-based conjugate gradient solver compute the non-zero coefficients of A directly from the grid. Thus, instead of computing A on the CPU and passing it down to the GPU, I decided to send the grid information (namely whether a cell is fluid or solid) to the GPU. The grid information is much smaller data set than A, which should save on transfer times. However, the more important benefit of this optimization is that each matrix-vector multiplication has to do only 7 multiplications per row of A.

The table below shows a performance comparison between simulations using the CPU implementation of the conjugate gradient solver, the naive GPU implementation, and the GPU implementation optimized for sparsity. The averages were computed over 100 frames of simulation. The grid contained 2048 cells, which means that the conjugate gradient solver in the projection step calculated the inverse of a 2048x2048 sparse matrix. In all cases the conjugate gradient solver was allowed a maximum of 200 iterations and a target error tolerance of 0.01.

CPU solver*GPU solver version 1**GPU solver version 2**
Grid Size64x32x1 (2048 cells)64x32x1 (2048 cells) 64x32x1 (2048 cells)
Avg. time per frame33774.14 msec1852.69 msec 249.55 msec
Avg. time per projection step33754.32 msec***1832.78 msec ***230.95 msec
Speedup (proj. step)x1.0 (base)x18.42 x146.15
Speedup (frame)x1.0 (base)x18.23 x135.34

* Intel Core i7 (2.67 GHz)
** nVidia GeForce GTX 295
*** The reported time per projection step includes the time it takes to transfer data from the CPU to the GPU.

Even with the sparsity optimization, the bulk of time is spent in the projection step (92.55%). My future goals are to move the entire simulation to the GPU and implement volumetric rendering for better visualization of the results. However, I don't expect any significant improvement in the running time after I move the rest of the simulation to the GPU as the CPU part of the implementation currently takes about 20msec., which is not at all significant compared to the 230msec required for the projection step.

Link to my final presentation poster: poster
The video below is of a simulation on a 32x32x5 grid (5120 cells) and was running at about 750 msec per frame. The smoke in the simulation was visualized using translucent cubes.

Thursday, April 7, 2011

CIS 565 - GPU Programming and Architecture Project

Optimization 1: Moving the conjugate-gradient solver to the GPU

I successfully moved the CPU conjugate gradient solver to the GPU. The process involved writing a bunch of small (and as of yet unoptimized) linear algebra kernels, such as matrix-vector multiplication, vector inner product, vector L2-norm, etc. I ran unit tests on the individual kernels to make sure they were working properly, but I left benchmarking and optimizations of the kernels for a later stage of the project.

The table below shows a performance comparison between simulations using the CPU implementation of the conjugate gradient solver I had before, and the naive (unoptimized) GPU implementation of the solver. The averages were computed over 100 frames of simulation. The grid contained 2048 cells, which means that the conjugate gradient solver in the projection step calculated the inverse of a 2048x2048 sparse matrix. In both cases the conjugate gradient solver was allowed a maximum of 200 iterations and a target error tolerance of 0.01.

CPU solver*GPU solver**
Grid Size64x32x1 (2048 cells)64x32x1 (2048 cells)
Avg. time per frame33774.14 msec1852.69 msec
Avg. time per projection step33754.32 msec***1832.78 msec
Speedup (proj. step)x1.0 (base)x18.42
Speedup (frame)x1.0 (base)x18.23

* Intel Core i7 (2.67 GHz)
** nVidia GeForce GTX 295
*** The reported time per projection step includes the time it takes to transfer data from the CPU to the GPU.

As seen from the table above, the projection step takes more than 99% of the time per frame. Thus, achieving a speedup of x18.42 on the projection step computations results in roughly the same speedup of the overall simulation.

(to be continued ...)

Tuesday, March 22, 2011

CIS 565 - GPU Programming and Architecture Project

I am embarking on a new project for my GPU Programming and Architecture class. The project goal is to produce a GPU implementation of a physically-based fluid simulator, using semi-Lagrangian advection scheme on a staggered grid. Technically I will be modifying and extending the CPU implementation that I made for another class (Physically-Based Animation).

The first goal of the project is to convert the projection step (i.e. the pressure update), which is the bottleneck of the CPU implementation, to run on the GPU. The projection step's purpose is to compute pressure differences between neighboring cells and adjust the fluid's velocity field, so that it stays incompressible and divergence-free. The step involves the solving of a linear system Ap = d, where A is n x n sparse matrix (where n is the number of cells in the simulation), d is a vector containing the velocity divergence of each of the n cells, and p is a unknown vector containing the pressures in each of the n cells. A and d are known, and the goal is to compute p, by computing the inverse of A (p = Ainv d) . Once the pressures for each of the cells are computed, the velocity field can be adjusted to become divergence free.



These videos are from my CPU implementation, which ran far from real-time even on small grids.

(to be continued...)