Making NumPy go faster with pytorch.compile
Since October 17th, PyTorch 2.1 has supported compiling NumPy functions into C++ or CUDA code (via triton
), even adding parallelization when possible with OpenMP.
In the PyTorch blogpost, they demonstrated how they achieved up to 200x faster execution times for a simple kmeans one-step algorithm.
Having seen these results, and inspired by this post, I needed to try this in order to check if I could have this speedup magnitudes in examples of my own code.
Trying it out in genetic algorithms
I’ve recently developed some interest in genetic algorithms as a form of heuristic search. It’s fun to implement them and the results they yield are impressive. However, I’ve noticed that if developed incorrectly, execution times may skyrocket, specially with complex fitness functions, high population sizes and sophisticated chromosome structures. Naturally, this led me to wonder about the extent of speedup that could be obtained by compiling NumPy code with this new method.
Example #1
To start with something simple, I wanted to examine the speedups on my machine using a straightforward yet computationally intensive fitness function: the sum of an array.
Assuming we have a population of 900,000 chromosomes, each consisting of 1,000 genes, and the fitness function for each chromosome is the sum of all its genes, let’s evaluate the execution times using different implementations of this fitness function.
First, here’s the fitness function in pure Python:
population = [random.choices(range(1, 25), k=1000) for _ in range(N_POPULATION)]
def fitness_fn_python(individual):
return sum(individual)
fitnesses = [fitness_fn_python(individual) for individual in population]
Executing this with a population of 900,000 chromosomes takes 6.91s, not including the time to generate the population. Now, let’s see how much we can accelerate this with numpy
and torch
.
This fitness function in NumPy is as follows:
population = np.random.randint(1, 25, size=(N_POPULATION, 1000), dtype=np.int8)
def fitness_fn_numpy(individual):
return np.sum(individual)
fitnesses = np.apply_along_axis(fitness_fn_numpy, 1, population)
It executes in 5.19s which is 33% faster than the pure Python. This is still slow, because it’s not implemented as good as it can be.
A more efficient implementation using NumPy would be:
def fitness_fn_numpy_better(population):
return np.sum(population, axis=1)
fitnesses = fitness_fn_numpy_better(population)
This version executes in 0.51s, which is 13.5 times faster than the pure Python implementation.
Now let’s compile this function with torch.compile
, to see how much of a speedup we can get.
import torch
fitness_fn_compiled = torch.compile(fitness_fn_numpy_better)
fitnesses = apt_fn_compiled(population)
The initial execution takes 0.59s due to the compilation overhead. However, on subsequent runs, it executes in 0.41s, making it 17 times faster than the pure Python version.
For even faster execution, we can write this function in PyTorch
, compile it, and run it on cuda
.
First, let’s rewrite the fitness function in PyTorch
:
population = torch.randint(1, 25, size=(N_POPULATION, 1000), dtype=torch.int8)
def fitness_fn_torch(population):
return torch.sum(population, dim=1)
fitnesses = fitness_fn_torch(population)
Uncompiled, this fitness function runs in 2.52s.
When compiled:
fitness_fn_compiled = torch.compile(fitness_fn_torch)
fitnesses = fitness_fn_compiled(population)
the first execution, including compilation time, takes 4.10s. But on subsequent executions, the time is reduced to 0.067s, which is 103 times faster than the pure Python implementation and 9 times faster than the uncompiled NumPy version.
The final performance boost comes from executing this PyTorch-compiled function on cuda
:
population = population.to('cuda')
with torch.device('cuda'):
fitnesses = fitness_fn_compiled(population)
This achieves an execution time of 0.055s, which is 125 times faster than pure Python and 10 times faster than NumPy, even on a modest laptop GPU.
In this example, maximum performance was achieved by writing and compiling the fitness function in PyTorch, yielding faster execution times than with PyTorch-compiled NumPy code.
Example #2
This is a more ‘real-word’-ish example of a numerical optimization problem with constrains. These problems aim to obtain the most optimum values for minimizing (or maximizing) a given mathematical function subject to certain constraints.
Consider the following function:
\[f(x, y, z) = x^2 + 2y^2 + 3z^2 + 4xy - 5yz\]with the constraints:
\[\begin{align*} & x \geq 1 \\ & 3 < y \leq 7 \\ & z < 0 \\ & x + y + z \geq 3 \\ & x^2 + z^2 \leq 10 \\ \end{align*}\]In pure python, the fitness function is the following:
import random
import math
population = [[random.uniform(-25, 25) for _ in range(3)] for _ in range(N_POPULATION)]
def function_to_optimize(individual):
x, y, z = individual
return x**2 + 2*y**2 + 3*z**2 + 4*x*y - 5*y*z
def check_constraints(individual):
x, y, z = individual
if x < 1:
return False
if not (3 < y <= 7):
return False
if z >= 0:
return False
if (x + y + z) < 3:
return False
if (x**2 + z**2) > 10:
return False
return True
def fitness_fn_python(individual):
if not check_constraints(individual):
return math.inf
return -function_to_optimize(individual)
fitnesses = [fitness_fn_python(individual) for individual in population]
Executing this for a population of 900,000 chromosomes takes 270ms.
Translating this function to vectorized NumPy:
import numpy as np
population = np.random.uniform(-25, 25, size=(N_POPULATION, 3))
def function_to_optimize(solutions):
x = solutions[:, 0]
y = solutions[:, 1]
z = solutions[:, 2]
return x**2 + 2*y**2 + 3*z**2 + 4*x*y - 5*y*z
# Vectorized function to check constraints
def check_constraints(solutions):
x = solutions[:, 0]
y = solutions[:, 1]
z = solutions[:, 2]
constraints = (x >= 1) & (y > 3) & (y <= 7) & (z < 0) & ((x + y + z) >= 3) & ((x**2 + z**2) <= 10)
return constraints
# Vectorized fitness function
def fitness_function_numpy(solutions):
# Check constraints for all solutions
valid_constraints = check_constraints(solutions)
# Apply constraints
fitness = np.where(valid_constraints, -function_to_optimize(solutions), np.inf)
return fitness
fitnesses = fitness_function_numpy(population)
The vectorized NumPy version takes 48ms, which is 5.6 times faster than the pure Python version.
If we compile this NumPy function with torch.compile
:
import torch
population = torch.rand(N_POPULATION, 3) * 50 - 25
check_constraints_compiled = torch.compile(check_constraints)
function_to_optimize_compiled = torch.compile(function_to_optimize)
def fitness_fn_pytorch(population):
# Check constraints for all solutions
valid_constraints = check_constraints_compiled(population)
# Apply constraints
fitness = np.where(valid_constraints, -function_to_optimize_compiled(population), np.inf)
return fitness
fitnesses = fitness_fn_pytorch(population)
This compiled function takes 6.2ms to run, excluding compile overhead. This is 44 times faster than pure Python and 7 times faster than standard NumPy.
When executing under CUDA, the function takes 3.7ms to run, which is 73 times faster than pure Python and 13 times faster than NumPy.
Example #3
Similar to the second example, we’ll try to optimize this function:
\[f(x, y, z) = \sin(x) + x^2 - 2\cos(y) + y^2 + z^2\]subject to these constrains:
\[\begin{align*} & x \geq 0 \\ & 0 \leq y \leq \pi \\ & z \leq 5 \\ & x + z \geq 1 \\ & \cos(y) + z \geq 0.5 \\ \end{align*}\]I will not provide the code for this experiment as it’s very similar to the second example, but the execution times are the following:
- 240ms on pure python.
- 82ms in NumPy, 3x faster than pure python
- 36.8ms with torch-compiled NumPy, 6.5x faster than pure python and 2x faster than NumPy.
- 22.6ms with torch-compiled NumPy under cuda, 10x faster than pure python and 3.6x faster than NumPy.
Conclusion
The recent update to PyTorch is a game-changer for Python programmers who need speed. By just tweaking a few lines of code, I was able to speed up computations by ten times compared to NumPy, which is already known for its efficiency. This is a big deal because it means complex operations that used to take a long time can now be done much quicker.
For example, in the field of machine learning and data analysis, time is often of the essence. Being able to run algorithms faster helps researchers and developers test more ideas and refine their models without long waits.
In essence, this PyTorch’s new feature means that you can write code that’s easy to read and maintain like Python, but when it comes down to running it, you get the kind of speed you’d expect from more complex, lower-level languages, without the usual complexity of interfacing with other programming languages or pre-compiling C/C++/Rust code, streamlining the workflow significantly.
Notes
Limitations in parallelization.
While this new compile function succeeds in detecting parallelizable pieces of code, it has been implemented for less than a month, and sometimes it fails to detect code segments that can be distributed across different cores, especially if they contain conditional logic not expressed in NumPy terms (see how the constraint check was done with np.where
clauses).
It’s possible that this gets better in the future, but it’s not a bad idea to make the code easier for TorchInductor
in order to get the best speedups possible.
Also, it’s a good idea to check the C++/triton code generated with TORCH_LOGS=output_code
env variable.
Compile overhead matters.
When using this method, it’s useful to think if the compile overhead it’s worth it or not in our use-case. This example functions have short compile times, however, with more complex tasks, the time it takes to compile could overshadow the time saved during execution.
In the PyTorch docs, we have this example:
model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
opt.zero_grad(True)
pred = mod(data[0])
loss = torch.nn.CrossEntropyLoss()(pred, data[1])
loss.backward()
opt.step()
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, eager_time = timed(lambda: train(model, inp))
eager_times.append(eager_time)
print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, compile_time = timed(lambda: train_opt(model, inp))
compile_times.append(compile_time)
print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert(speedup > 1)
print(f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)
In which the output is the following:
eager train time 0: 0.3666875
eager train time 1: 0.06788114929199218
eager train time 2: 0.06595875549316406
eager train time 3: 0.06623709106445312
eager train time 4: 0.06622108459472656
eager train time 5: 0.06640013122558594
eager train time 6: 0.06633692932128907
eager train time 7: 0.06608159637451172
eager train time 8: 0.06633171081542968
eager train time 9: 0.06623030090332031
~~~~~~~~~~
skipping cudagraphs due to input mutation
compile train time 0: 285.17234375
compile train time 1: 3.08613623046875
compile train time 2: 0.04830195236206054
compile train time 3: 0.03755414581298828
compile train time 4: 0.037427520751953124
compile train time 5: 0.03738982391357422
compile train time 6: 0.03739823913574219
compile train time 7: 0.03733385467529297
compile train time 8: 0.03738035202026367
compile train time 9: 0.03721017456054688
~~~~~~~~~~
Traninig times are 1.77 times faster, but the first iteration takes 790 times as long. So, compiling would be only useful if the training steps are high enough.
References
Making Python 100x faster with less than 100 lines of Rust - Ohad Ravid
torch.compile Tutorial - William Wen
Compiling NumPy code into C++ or CUDA via torch.compile - PyTorch blog