Efficient allocation of image recognition and LLM tasks on multi-GPU system
Journal:
arXiv
Published Date:
Mar 19, 2025
Abstract
This work is concerned with the evaluation of the performance of
parallelization of learning and tuning processes for image classification and
large language models. For machine learning model in image recognition, various
parallelization methods are developed based on different hardware and software
scenarios: simple data parallelism, distributed data parallelism, and
distributed processing. A detailed description of presented strategies is
given, highlighting the challenges and benefits of their application.
Furthermore, the impact of different dataset types on the tuning process of
large language models is investigated. Experiments show to what extent the task
type affects the iteration time in a multi-GPU environment, offering valuable
insights into the optimal data utilization strategies to improve model
performance. Furthermore, this study leverages the built-in parallelization
mechanisms of PyTorch that can facilitate these tasks. Furthermore, performance
profiling is incorporated into the study to thoroughly evaluate the impact of
memory and communication operations during the training/tuning procedure. Test
scenarios are developed and tested with numerous benchmarks on the NVIDIA H100
architecture showing efficiency through selected metrics.