Recently we’ve started transitioning from using Celery to using RQ as our task running engine. For phase one, we only migrated the jobs that aren’t directly running queries. These jobs include things like sending emails, figuring out which queries need to be refreshed, recording user events and other maintenance jobs.

After deploying this we noticed our RQ workers required significantly more CPU to perform the same amount of tasks as our Celery workers did. I thought I’d share how I profiled this and remedied the problem.

A word about the differences between Celery and RQ

Both Celery and RQ have the concept of a Worker process and both use forking to allow parallelized execution of jobs. When you launch a Celery worker it forks to several different processes, each one autonomously handles tasks. With RQ, a worker will only instantiate a single sub-process (known as a “Work Horse”) which will perform a single job and then die. When the worker fetches another job from the queue, it will fork a new Work Horse.
In RQ you can achieve the same parallelism as Celery simply by running more worker processes. However there is a subtle difference between Celery and RQ: Celery workers instantiate multiple subprocesses at launch time and reuse them for multiple tasks. With RQ, you have to fork on every job. There are pros and cons to both approaches, but these are out of scope for this post.

Benchmarking

Before I profiled anything, I wanted a benchmark of how long it takes for a worker container to process 1000 jobs. I decided to focus on the record_event job since it is a frequent, lightweight operation. I used the time command to measure performance, which required a couple changes to the source code:

  1. To measure these 1000 jobs I preferred RQ’s burst mode, which exits the process after handling the jobs.
  2. I wanted to avoid measuring other jobs that might be scheduled at the time of benchmarking. So I moved record_event to a dedicated queue called benchmark by replacing @job(‘default’) with @job(‘benchmark’) right above record_event’s declaration in tasks/general.py.

Now we can start timing things. First of all, I wanted to see how long it takes for a worker to start and stop (without any jobs) so I can subtract that time from any result later.

$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"

real	0m14.728s
user	0m6.810s
sys	0m2.750s

Worker initialization takes 14.7 seconds on my machine. I’ll remember that.

Then, I shoved 1000 dummy record_event jobs onto the benchmark queue:

$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]"

Now let’s run the same worker command as before and see how long it takes to process 1,000 jobs:

$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"

real	1m57.332s
user	1m11.320s
sys	0m27.540s

Deducting our 14.7 boot time, we see that it took 102 seconds for 4 workers to handle 1,000 jobs. Now let’s try to figure out why! For that, we’ll use py-spy while our workers are working hard.

Profiling

Let’s add another 1,000 jobs (because our last measurement consumed all of them), run the workers and simultaneously spy on them:

$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]"
$ docker-compose exec worker bash -c 'nohup ./manage.py rq workers 4 benchmark & sleep 15 && pip install py-spy && rq info -u "redis://redis:6379/0" | grep busy | awk "{print $3}" | grep -o -P "\s\d+" | head -n 1 | xargs py-spy record -d 10 --subprocesses -o profile.svg -p'
$ open -a "Google Chrome" profile.svg

I know, that last command is quite a handful. Ideally I would break that command on every ‘&&’ for readability, but the commands should run sequentially inside the same docker-compose exec worker bash session, so here’s a quick breakdown of what it does:

  1. Start 4 burst workers in the background
  2. Wait 15 seconds (roughly to get them to finish booting)
  3. Install py-spy
  4. Run rq-info and slice up the pid for one of the workers
  5. Record 10 seconds of activity in that pid and save it to profile.svg

The result was this flame graph:

Digging inside the flame graph I noticed that record_event spends a big portion of its execution time in sqlalchemy.orm.configure_mappers and it happens on every job execution. From their docs I saw this:

Initialize the inter-mapper relationships of all mappers that have been constructed thus far.

This sort of thing really doesn’t need to happen on every fork. We could initialize these relationships in the parent worker once and avoid the repeated effort in the work horses.

So I’ve added a call to sqlalchemy.org.configure_mappers() before starting the work horse and measured again:

$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]
$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"

real	0m39.348s
user	0m15.190s
sys	0m10.330s

If we deduct our 14.7 second boot time, we are down from 102 seconds to 24.6 seconds for 4 workers to handle 1000 jobs. That’s a 4x improvement! With this fix we managed to slice our RQ production resources by 4 and keep the same throughput.

My key takeaway here is that you should keep in mind that your app behaves differently when it’s a single process and when it forks. If there’s some heavy lifting to perform on every job, it’s usually a good idea to do it once before the fork. These kinds of things don’t show up during testing and development, so make sure you measure well and dig into any performance issues that come up.