Getting the Most Out of Your GPU for Machine Learning Applications

Jul 20

Introduction:

In the spirit of continually contributing to the open-source community, Data Machines Corp. (DMC) would like to share some of the ways you can diagnose and fix your Nvidia Graphics Processing Unit-related (GPU) issues on a machine learning compute server.

DMC provides and assembles computing hardware for numerous organizations and academic institutions. One of our roles involves maintaining hardware for cloud computing environments related to Machine Learning (ML) model training utilizing Nvidia-GPUs.

A GPU is an important component for ML. Through this blog post, we would like to share some of our expertise on improving GPU performance.

If your interest includes using NVIDA-GPUs for machine learning, please refer to our blogpost to learn about a containerized toolkit for ML and Computer Vision (CV) libraries.

Graphics Card Performance Issues:

Some potential performance issues with GPUs can include the following:

1. Limited memory bandwidth: GPUs typically have limited memory bandwidth, which can impact performance when training large neural networks.

2. Limited computational resources: GPUs tend to have limited computational resources, which can impact performance when training large neural networks or running complex algorithms.

3. Power consumption: GPUs can consume a lot of power, which ultimately impacts performance and increases operating costs. An inadequately run power supply unit (PSU) can damage a GPU and related components. In addition, voltage fluctuations are detrimental for GPUs because they can cause the GPU to overheat and potentially damage the card.

4. Overheating components: GPUs can generate a lot of heat and, therefore, are subject to overheating. Overheating causes GPU performance to degrade because the heat makes the GPU throttle itself in order to prevent damage. Throttling can cause the GPU to slow down or even shut off completely.

To get the maximum performance out of your GPU, monitor power consumption and ensure that the GPU does not overheat. Modern ML servers have multiple GPUs to allow for more flexibility; this, however, adds more complexity to both power and heat management.

How to Achieve Optimal GPU Performance:

To take full advantage of specialized ML hardware, it’s imperative to ensure that your graphics card is running at the optimal speed. To check the performance state of your graphics card, we suggest using various comment-line-interfaces (CLI).

The NVIDIA System Management Interface (nvidia-smi) is a command line utility that provides management and monitoring capabilities for NVIDIA GPUs and their drivers. This interface can monitor and manage each of NVIDIA's Tesla, GRID, and GeForce devices from a local or a remote host. It can also be used to query and configure the devices, and to monitor their overall health and performance. To learn more about nvidia smi command, visit https://helpmanual.io/help/nvidia-smi/.

Nvidia-smi command allows you to access the following information of your NVIDIA GPU:

Status - the current statusClocks - the current clock speeds
Memory - the current memory usage
Temperature - the current temperature
Processes - a list of processes running on a GPU

The -q option allows you to query various GPU state information. The -d followed by CLOCK, POWER, PERFORMANCE, and MEMORY option allows you to display the performance of your GPUs. Here are the available options:

nvidia-smi -q -d PERFORMANCE
nvidia-smi -q -d CLOCK
nvidia-smi -q -d POWER
nvidia-smi -q -d MEMORY

While your GPU is under ML training workload, you can run these commands to check the current load of a specific GPU. Both nvidia-smi and nvidia-smi -q -d PERFORMANCE display the “P” state of a GPU. P state refers to the current performance of a GPU. In addition, you can use DCGM Prometheus exporters to monitor GPU statistics in real time.

A GPU’s Performance (P) States

P0 and P1 are the power states when the GPU is operating at its highest performance level.
P2/P3 is the power state when the GPU is operating at a lower performance level.
P6-P12 is the power state when the GPU is operating at a low power level to an idle state. The higher the P number, the lower the performance.

Possible reasons for a GPU running at a lower P state may include:

GPU does not have enough available power to draw from the power supply.
The ML workload isn’t demanding the GPU to reach its max P state.
The GPU is heating up too quickly. Thermal throttling is ensuing because the fan and the ambient temperature are not cooling the GPU fast enough.
Often, it is also the case that the ML workload did not specify the GPU as its compute resource, as this is not automatic. The memory representation of the ML workload needs to be adapted to the GPU to use it.

Possible Solutions for a Problematic GPU:

First Step

If a GPU is not working properly, you can try to fix the issue by following these steps:

First, check to see if the card is properly seated in its PCI slot. If it is not, try reseating it and make sure that it is firmly inserted.
Next, check the power supply. Remember, it needs to provide adequate power to the card. If the power supply is not working properly, you may need to replace it with a different one with a higher draw. Many new GPUs require specialized, additional power supply connectors that must be added directly to the GPU itself.
Finally, check the drivers for the card. If the drivers are outdated, you can download and install the latest drivers from the manufacturer's website.

Once you have completed these basic steps, you can move on to more intermediate steps.

Intermediate Steps

To check the current state of your graphics card, run the above detailed “nvidia-smi” command. Here is a sample output from nvidia-smi command:

As you can see, this output has revealed helpful information, including power wattage, memory usage, temperature, fan speed, and GPU utilization.

Under the “Perf” column, notice that the P state displays P2 in the above output.

Here you can see that this server has 8x RTX A6000 GPUs (numbered 0 to 7). For each GPU, you can view its fan utilization, temperature, P-state, current power draw/max power, used memory/total memory, and GPU utilization. In the lower table, you can view the processes running on each GPU IDs.

In the top bar, the driver version and the max CUDA version the driver supports are also made visible. This is important in particular when using containerized applications: it can only access resources up to the driver’s CUDA version. For example, you cannot use CUDA 11.6 specific containers when the driver supports only up to 11.4.

Recommended Solutions to Achieve a Higher GPU Speed:

First, check the thermal use of your GPU. It will automatically throttle if its use goes above a certain threshold. Improving the airflow inside your server will go a long way toward enabling its full use if power consumption is not the issue.

If your GPU card is running in P2 state instead of P0, check the power supply to confirm that the power source is providing enough electricity to the power supply unit. If the power supply is insufficient, the card will not be able to run at full speed. Some servers have multiple power suppliers for redundancy.

Next, check the BIOS settings to inquire if the card is set to run in P2 state instead of P0. If it is, change the setting to P0 and save the changes. Finally, reboot the machine. The card should now be running in the P0 state.

After the reboot, if a graphics card performance is still pinned to P2, you can move on to update the BIOS firmware and data. This can be accomplished by downloading the latest BIOS data from the manufacturer's website and then updating the BIOS on your machine. If this does not fix the issue, you may need to contact the manufacturer for further assistance.

Final Steps:

Contact the Manufacturer

If you continue to experience issues with your graphics card, the last course of action should be to contact the manufacturer. The manufacturer will be able to help you troubleshoot the issue and determine if there is a problem with the card.

When we contacted the vendor they collected various system logs from us and then recommended that we adjust our fan profile.

Collect Logs

Many servers have a web interface to control system settings over the management network; this interface is called Intelligent Platform Management Interface (IPMI). The baseboard management controller (BMC) has a similar function as IPMI.

IPMI and BMC serve as a set of computer hardware interfaces used by systems administrators to remotely manage servers and other systems.

To collect system logs from BMC, you will need to use the BMC tools available from the web interface of your server. Your server’s included documentation should note the setup process for both IPMI and BMC.

Once you obtain the IPMI information, you need to connect to your server using the IPMI protocol, or access the web interface. To do this, you will need to know your server's IPMI or BMC IP address, username, and password. Once you are connected, you will be able to view and collect the necessary system logs from the IPMI interface in addition to controlling fan profiles and power controls such as starting, stopping, and rebooting servers.

Adjust the Fan Profile

Our systems provide us with a user interface to update fan profiles, which can be adjusted from a BIOS or IPMI screen. After updating our fan profile, the system displayed a significant drop in temperature and indicated some general performance improvements. One system, however, still didn’t show signs of any improvements and, as a result, we had to resume our conversation with our vendors and manufacturers.

What to Expect When Contacting Your Vendor:

Many vendors provide their contact information on their product website. A majority have an online contact form, which allows you to submit a support ticket. Plan to provide service tags, order numbers, and serial numbers as part of your form submission.

When contacting a vendor, you should expect to receive a response within a few business days. The vendor may ask for additional information, such as your contact information, description of your problem, logs, and system information.

Once the support ticket has been submitted and the conversation initiated, it's a back-and-forth conversation though a support interface or email. Throughout this process, the vendors will collect logs and system information. Upon reviewing those materials, they can then suggest various operations in order to improve system performance.

In some cases, the vendors or manufacturers will provide executable files that you can run in order to collect additional data about your system. In some cases, the executable may be system-specific. Be aware that running executable files can get complicated because the executables can be platform-specific as well.

If at the end of this troubleshooting process your system still has issues, we recommend discussing with your manufacturer or vendor the steps you will need to take to replace your system.

We hope this blog was useful to you and wish you luck as you begin the process of solving your GPU-related issues.

Matt Quinn

Getting the Most Out of Your GPU for Machine Learning Applications

A SkillBridge Success Story: U.S. Army Captain Michael Humphrise

DMC Attends the 2022 Special Operations Forces Industry Conference