banner
Light

Light Log

做充满希望的动物
x
github
bilibili
steam

Regarding the issue of falling behind the submission deadline and being forced to use GPU for heterogeneous computing (1)

Summary#

When I decided to submit a paper to a conference, there was only one week left until the submission deadline. However, the data I had previously obtained could not be used directly, so I had to start from scratch to design a program to gather data. Fortunately, the core computational module could be reused. I was quite optimistic on the first day, estimating that it would take two days to gather experimental data, three days to write the paper, and I would have the remaining time for revisions, allowing me to submit before the deadline.

However, the actual situation was less than ideal. Based on past experience, I underestimated the runtime of the experiments. My original experiment involved calculating four large complex matrices per round, but this time, to obtain more detailed results, the number of matrices to be calculated per round skyrocketed to 42. To make matters worse, I initially planned to run 28 rounds of calculations, and the size of each matrix in each round would be double that of the previous round. By the 28th round, the data size of each matrix would reach about 1GiB, and to process this matrix, I would need to store three additional matrices of the same size (for clarity, let's refer to the previous matrix as the core matrix and these three matrices as auxiliary matrices)! This means that just processing one core matrix requires 4GiB of space, not to mention the intermediate data generated along the way, and I needed to perform this calculation 42 times (and this is just for the single calculation process of the 28th round!)...

I couldn't compress such a computational load onto a single CPU core unless I wanted to compute indefinitely... My initial plan was to use multi-core parallel computing. Since the calculations for each core matrix are independent of each other, this approach was suitable for parallel computing. This was correct in my earlier experiments, as I only involved four core matrices per round, and even with the intermediate data, it wouldn't exceed my 64GB of memory. I didn't need to consider other options; as long as I activated four CPU cores simultaneously, I could get results in about two hours.

But when the number of core matrices per round reached 42, this approach became unfeasible! First, to shorten the computation time, I needed a larger degree of parallelism, and when I increased the parallelism to 16, the total memory usage for the 28th round exceeded 100GB! I had to add extra memory, but in reality, even with sufficient memory, such a computational load and data exchange volume were still too much for an ordinary home CPU. The computation time for each round after the 20th was measured in hours and was growing exponentially... Not to mention that if there were unexpected exits or unsatisfactory output results along the way, I would inevitably have to revisit the program design, and it was very likely that I would need to rerun several rounds from scratch...

I really wish I could turn back time a few months! That way, I could properly design my experimental plan, and at the very least, I would have enough time to wait for the entire computation process. But now, even if I could finish writing the paper in two days, I only had four to five days left! I couldn't afford to implement complex optimization designs because I didn't have time to debug the program repeatedly. The only option in front of me was to find a simple and brute-force solution that could reuse most of the program without making extensive modifications to the overall framework while also finding ways to reduce the exponentially increasing computation time.

Let's look at the core idea of this problem: the size of each core matrix doubles every round. To keep the computation time growing linearly, we need to ensure that the computation time for each matrix also remains linear. Therefore, the simplest method is to double the computation module when the matrix size doubles.

Unfortunately, my CPU cannot do this; it only has 8 cores! However, we do have other means. Think of that famous video where, in an episode of "MythBusters," they vividly illustrated the main differences between CPU and GPU using paintball guns. If we can't make our CPU cores meet our exponential growth demands, then offloading matrix-related computations to a GPU with stronger parallel computing capabilities is a wise choice; GPUs are practically designed for this!

Heterogeneous Computing Solutions#

The method of combining different computing units for computation is called heterogeneous computing. The most typical example is the combination of CPU and GPU. Yes, although this term sounds very advanced, it is widely used in many fields, with gaming being perhaps the most relatable. The combination of CPU and GPU is a hot area, and because of the widespread nature of this solution, we have many simple and useful tools to support us in converting single CPU architecture programs into GPU-integrated programs. The most representative of these tools is NVIDIA's CUDA, which is designed for its own GPU products, as well as OpenCL, which supports GPUs from multiple vendors. In recent years, with the increasing focus on GPU computing, AMD has also launched its ROCm solution, but the community and ecosystem for this tool still need improvement. While I am eager to try new technologies, I must first complete my experiment!

Both my home computer and the one in my research lab use NVIDIA GPUs, so my initial consideration was to use the CUDA solution. CUDA is a very mature tool, and many useful libraries have been developed for it. Especially in recent years, with the use of GPUs in AI, many user-friendly libraries for calling CUDA have emerged, even allowing computations to be offloaded to the GPU without modifying the original code. Unfortunately, my current program is not suitable for this. I ultimately used a Python library called CuPy because it is highly similar to NumPy; you can think of it as a simple implementation of the GPU version of NumPy. This is important to me because, involving matrix computations, I extensively used NumPy's computational methods and data types in my program to improve computation speed. I did not want to have to reshape the data excessively when offloading the computation module (I mean the term offload) to the GPU, as this would not only require significant modifications to my code but would also incur additional computational overhead.

CuPy is very easy to use; I successfully modified and ran the program in about an hour. I won't go into too much detail about the code implementation here, but if I have time, I might write a separate article to discuss these (preferably after I have a deeper understanding of CUDA). However, at this point, I did not achieve the results I wanted; reality is always more complex than people imagine.

The first issue remains the data transfer problem. Although modern GPUs use high-speed storage, if data is repeatedly exchanged between the CPU and GPU, a significant amount of time will be consumed on this, and the computation part must always wait for the data to be ready. Once the data is gathered, it won't take long to compute, but then the results need to be transferred out immediately. This is not a problem that can be solved by using a pipeline structure or chunking data transfers. In the initial design, the GPU only needed to perform a few simple calculations on each matrix element, which you can simply understand as multiplying each element by 2. Therefore, pipeline-type computation structures are not very applicable. Additionally, from the beginning, our computations did not need to wait for the entire matrix to be transferred, and the computation process would not cause significant delays. The root of all the problems lies here: our computation part is fast enough, so the performance bottleneck falls on data transfer.

It is challenging to change this without altering the original design. I had to adjust the computational structure, offloading many subsequent data processing parts that were on the CPU to the GPU. This modification was very effective; we increased the GPU's computational load while minimizing the frequent data transfers between the CPU and GPU, which brought my GPU's maximum computational load up to 80%.

However, another problem arose here. Now my GPU was no longer just processing each matrix element; it needed to save the processed matrix and then perform subsequent processing on it as a whole. As mentioned earlier, we would need very large matrices, and at this point, my GPU memory began to run low. Unfortunately, my GPU only has 4GB of memory. Although we can share some system memory through SWAP, this part would encounter the frequent data transfer issues... The method I could think of was to split the matrix, giving the GPU only part of the matrix to obtain partial results, and then merging them after transferring to the CPU. This was somewhat better than before, but after the 24th round, I still inevitably encountered bottlenecks due to data transfer, and the hardware limitations were more severe than expected.

Due to time constraints, I had to abandon the original plan of running 28 rounds and, while ensuring the verifiability and completeness of the experimental results as much as possible, reduced the number of calculations, opting for a 25-round plan. At this point, I successfully obtained all the results I needed within 8 hours and completed my paper before the deadline (thanks to my teacher!).

I am still continuing to optimize the program, and the next directions are twofold: one is to distribute the computational load across multiple machines using solutions like MPI, but I anticipate that there will still be data transfer bottlenecks to solve; the other is to use FPGA for data computation and then pass it to the host. In cases where resources are sufficient, this solution has advantages over GPUs in terms of data transfer and latency, but the development cost is also significantly higher. I am still hesitant about this solution unless one day I find that I indeed need customized hardware to solve some problems that traditional solutions cannot address.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.