How We Reduced Skyrocketing CPU Usage

By Vidit Mathur

Legacy applications stay true to the memes and jokes made on them. Working on such applications pose a variety of challenges: Updating codebase to add new features, resolving existing issues while the service hasn’t been actively maintained, working with scanty documentation (or none at all), and onboarding new folks to take up the task.

That being said, these applications are a part of the ecosystem and aren’t completely disposable.

One such service we use at Gojek is called vesemir, which is used for deploying to virtual machines. It works on Python 2 and integrates with the ansible version 2.1 to deploy other applications. vesemir uses gunicorn as its HTTP server.

But… there was a problem

The vesemir service is hosted on virtual machines and we observed that quite often after the service restart, the CPU usage of our VMs would spike to almost 100% without any corresponding increase in application traffic.

What caused this high CPU utilisation?

Our analysis of processes running on the VM showed that, with each deployment request received by the vesemir service, a number of processes were spawned. These processes didn’t get terminated even after the execution was finished and kept on going with the number of requests served and continued to consume the CPU resources.

What caused the increase in the number of processes?

Was it a memory leak?

We ruled out memory leaks as we did not observe any spike in memory usage for any of the instances.

Was it `gunicorn`?

We observed that processes other than the gunicorn workers which were being spawned were actually the child processes of these workers.

To verify if gunicorn was in fact spawning more workers than required, we referred to our application logs and observed that there were no logs for the process ids other than the expected workers. This helped us rule out a probable issue with gunicorn.

Was it the ansible?

A deep dive into the ansible code revealed that it spawns a number of processes to execute the ansible plays.

However, Ansible wasn’t really the culprit since it provides the capability to clean up the spawned process after it finishes execution.

Then who’s the culprit here?

A further deep dive into the ansible library revealed that, though the module expects a cleanup on the processes and issues a SIGTERM signal for process termination, the processes still don’t end up terminating.

Why did the processes not terminate?

An extensive read on Python revealed that there is an inherent issue with the way subprocesses are forked in Python 2. More on this topic is discussed in this article.

The solution

To resolve the issue, we decided to upgrade to Python 3 as it resolves the inherent issue related to the spawning of subprocesses in Python 2. This also required us to upgrade to ansible version 2 which is the first ansible version that supports Python 3 for production applications.

⚠️ Ansible version 2.2 to 2.4 support Python 3 only as a tech preview.

Did the migration resolve the issue?

Good for us, it did. We did not see any gradual increase in CPU usage after the migration and even at peak hours, the CPU usage did not shoot over 20%. Also, this allowed us to downsize our VMs and reduce VM counts which resulted in decent cost-saving.

The tale of how we migrated to Python 3 is a whole other story — A story which will soon be up! Stay tuned.

Click here for more stories about how we build our Gojek #SuperApp. Click here if you’d like to build it with us.