**Optimization 1: Moving the conjugate-gradient solver to the GPU**

I successfully moved the CPU conjugate gradient solver to the GPU. The process involved writing a bunch of small (and as of yet unoptimized) linear algebra kernels, such as matrix-vector multiplication, vector inner product, vector L2-norm, etc. I ran unit tests on the individual kernels to make sure they were working properly, but I left benchmarking and optimizations of the kernels for a later stage of the project.

The table below shows a performance comparison between simulations using the CPU implementation of the conjugate gradient solver I had before, and the naive (unoptimized) GPU implementation of the solver. The averages were computed over 100 frames of simulation. The grid contained 2048 cells, which means that the conjugate gradient solver in the projection step calculated the inverse of a 2048x2048 sparse matrix. In both cases the conjugate gradient solver was allowed a maximum of 200 iterations and a target error tolerance of 0.01.

CPU solver* | GPU solver** | |

Grid Size | 64x32x1 (2048 cells) | 64x32x1 (2048 cells) |

Avg. time per frame | 33774.14 msec | 1852.69 msec |

Avg. time per projection step | 33754.32 msec | ***1832.78 msec |

Speedup (proj. step) | x1.0 (base) | x18.42 |

Speedup (frame) | x1.0 (base) | x18.23 |

* Intel Core i7 (2.67 GHz)

** nVidia GeForce GTX 295

*** The reported time per projection step includes the time it takes to transfer data from the CPU to the GPU.

As seen from the table above, the projection step takes more than 99% of the time per frame. Thus, achieving a speedup of x18.42 on the projection step computations results in roughly the same speedup of the overall simulation.

(to be continued ...)