It is interesting to note that the conjugate gradient solver needs to solve a system of equations in the form A p = d, where A is a very sparse matrix computed based on the boundary conditions of the underlying grid, p is unknown pressure vector needed to make the velocity field divergence-free, and d is the current velocity divergence of each cell. A is quadratically proportional to the grid size and it has at most 7 non-zero entries in each row and each column.
Therefore, I decided to take advantage of the sparsity of A in the conjugate gradient solver. Instead of doing large matrix-vector multiplications of matrix rows consisting of predominantly 0s, I decided to make the GPU-based conjugate gradient solver compute the non-zero coefficients of A directly from the grid. Thus, instead of computing A on the CPU and passing it down to the GPU, I decided to send the grid information (namely whether a cell is fluid or solid) to the GPU. The grid information is much smaller data set than A, which should save on transfer times. However, the more important benefit of this optimization is that each matrix-vector multiplication has to do only 7 multiplications per row of A.
The table below shows a performance comparison between simulations using the CPU implementation of the conjugate gradient solver, the naive GPU implementation, and the GPU implementation optimized for sparsity. The averages were computed over 100 frames of simulation. The grid contained 2048 cells, which means that the conjugate gradient solver in the projection step calculated the inverse of a 2048x2048 sparse matrix. In all cases the conjugate gradient solver was allowed a maximum of 200 iterations and a target error tolerance of 0.01.
CPU solver* | GPU solver version 1** | GPU solver version 2** | |
Grid Size | 64x32x1 (2048 cells) | 64x32x1 (2048 cells) | 64x32x1 (2048 cells) |
Avg. time per frame | 33774.14 msec | 1852.69 msec | 249.55 msec |
Avg. time per projection step | 33754.32 msec | ***1832.78 msec | ***230.95 msec |
Speedup (proj. step) | x1.0 (base) | x18.42 | x146.15 |
Speedup (frame) | x1.0 (base) | x18.23 | x135.34 |
* Intel Core i7 (2.67 GHz)
** nVidia GeForce GTX 295
*** The reported time per projection step includes the time it takes to transfer data from the CPU to the GPU.
Even with the sparsity optimization, the bulk of time is spent in the projection step (92.55%). My future goals are to move the entire simulation to the GPU and implement volumetric rendering for better visualization of the results. However, I don't expect any significant improvement in the running time after I move the rest of the simulation to the GPU as the CPU part of the implementation currently takes about 20msec., which is not at all significant compared to the 230msec required for the projection step.
Link to my final presentation poster: poster
The video below is of a simulation on a 32x32x5 grid (5120 cells) and was running at about 750 msec per frame. The smoke in the simulation was visualized using translucent cubes.