The Sheer Joy of Accelerating Your Existing Python Code with Numba! - Part I

  • One of the biggest complaints people have against Python is that it’s slow. Some people almost refuse to even try python because its slower than X.

The Million Dollar Question:

Why is Python so popular despite being so slow?

For a long time, my unexpressed (and unjustified) answer to this question was:

Yes, Python is Slow, and I Don’t Care :)!

As I gained more experience with Python over time, I learned a great deal and was able to formulate a more coherent answer (mostly for myself):

  • Speed No Longer Matters

    • Optimize your most expensive resource:

      • Historically, the most expensive resource was CPU time.

      • Today, it is more important to get stuff done than to make it fast.

  • But what if speed really does matter?

    • Always try to find the bottlenecks in your code, via profiling first.

"Any improvements made anywhere besides the bottleneck are an illusion." ― Gene Kim, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." ― Donald Knuth

  • The CPU time is rarely the limiting factor

    • Even if Python is slow, it doesn’t matter. The speed of the language (or CPU time) is almost never the issue.

Google actually did a study on this very concept, and they wrote a paper on it. The paper talks about designing a high throughput system. In the conclusion, they say:

It may seem paradoxical to use an interpreted language in a high-throughput environment, but we have found that the CPU time is rarely the limiting factor; the expressibility of the language means that most programs are small and spend most of their time in I/O and native run-time code. Moreover, the flexibility of an interpreted implementation has been helpful, both in ease of experimentation at the linguistic level and in allowing us to explore ways to distribute the calculation across many machines.

References:

What is Numba?

a JIT (Just-in-Time) compiler for Python that:

  • generates optimized machine code using LLVM (Low Level Virtual Machine) compiler infrastructure
  • provides toolbox for different targets and execution models:
    • Single-threaded CPU, multi-threaded CPU, GPU
    • regular functions, "universal functions (ufuncs)" (array functions), etc
  • integrates well with the Scientific Python stack
  • with a few annotations, array-oriented and math-heavy Python code provides:
    • speedup: 2x (compared to basic NumPy code) to 200x (compared to pure Python)
    • performance similar to C, C++, Fortran, without having to switch languages or Python interpreters
  • is totally awesome!

Basic Example

Lazy Compilation

  • Use @jit decorator
  • Let Numba decide when and how to optimize
In [1]:
import numpy as np
import numba
from numba import jit
In [2]:
@jit
def do_math(x, y):
    return x + y

In this mode:

  • The compilation will be deferred until the first execution
  • Numba will:
    • infer the argument types at call time
    • generate optimized code based on this information
  • Numba will also be able to compile separate specializations depending on the input types. For instance, calling do_math() with integer or complex numbers will generate different code paths:
In [3]:
do_math.inspect_types()
In [4]:
%time do_math(1, 2)
CPU times: user 128 ms, sys: 25.4 ms, total: 153 ms
Wall time: 180 ms
Out[4]:
3
In [5]:
%time do_math(1, 2)
CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 10 µs
Out[5]:
3

What is Numba doing to make code run quickly?

Numba examines Python bytecode and then translates this into an 'intermediate representation'. To view this IR, after running (compiling) do_math() and you can access the inspect_types method.

In [6]:
do_math.inspect_types()
do_math (int64, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: int64
    #   y = arg(1, name=y)  :: int64
    #   $0.3 = x + y  :: int64
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: int64
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================
In [7]:
%time do_math(1j, 2)
CPU times: user 32.2 ms, sys: 2.61 ms, total: 34.8 ms
Wall time: 32.9 ms
Out[7]:
(2+1j)
In [8]:
%time do_math(1j, 2)
CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 10.7 µs
Out[8]:
(2+1j)
In [9]:
do_math.inspect_types()
do_math (int64, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: int64
    #   y = arg(1, name=y)  :: int64
    #   $0.3 = x + y  :: int64
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: int64
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================
do_math (complex128, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: complex128
    #   y = arg(1, name=y)  :: int64
    #   $0.3 = x + y  :: complex128
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: complex128
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================

Eager compilation

  • Tell Numba the function signature you are expecting
In [10]:
from numba import int32
In [11]:
@jit(int32(int32, int32))
def eager_do_math(x, y):
    return x + y
In [12]:
%time eager_do_math(1, 2)
CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 11 µs
Out[12]:
3
In [13]:
%time eager_do_math(1.0, 2.0)
CPU times: user 8 µs, sys: 1 µs, total: 9 µs
Wall time: 11.9 µs
Out[13]:
3
In [14]:
%time eager_do_math(1j, 2)
-------------------------------------------------------------------------
TypeError                               Traceback (most recent call last)
<timed eval> in <module>

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/numba/dispatcher.py in _explain_matching_error(self, *args, **kws)
    461         msg = ("No matching definition for argument type(s) %s"
    462                % ', '.join(map(str, args)))
--> 463         raise TypeError(msg)
    464 
    465     def _search_new_conversions(self, *args, **kws):

TypeError: No matching definition for argument type(s) complex128, int64

How does Numba work?

What about the actual LLVM code?

You can see the actual LLVM code generated by Numba using the inspect_llvm() method.

In [15]:
for key, value in do_math.inspect_llvm().items():
    print(key, value)
(int64, int64) ; ModuleID = 'do_math'
source_filename = "<string>"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin17.7.0"

@"_ZN08NumbaEnv8__main__11do_math$241Exx" = common local_unnamed_addr global i8* null
@.const.do_math = internal constant [8 x i8] c"do_math\00"
@PyExc_RuntimeError = external global i8
@".const.missing Environment" = internal constant [20 x i8] c"missing Environment\00"

; Function Attrs: norecurse nounwind
define i32 @"_ZN8__main__11do_math$241Exx"(i64* noalias nocapture %retptr, { i8*, i32 }** noalias nocapture readnone %excinfo, i64 %arg.x, i64 %arg.y) local_unnamed_addr #0 {
entry:
  %.14 = add nsw i64 %arg.y, %arg.x
  store i64 %.14, i64* %retptr, align 8
  ret i32 0
}

define i8* @"_ZN7cpython8__main__11do_math$241Exx"(i8* nocapture readnone %py_closure, i8* %py_args, i8* nocapture readnone %py_kws) local_unnamed_addr {
entry:
  %.5 = alloca i8*, align 8
  %.6 = alloca i8*, align 8
  %.7 = call i32 (i8*, i8*, i64, i64, ...) @PyArg_UnpackTuple(i8* %py_args, i8* getelementptr inbounds ([8 x i8], [8 x i8]* @.const.do_math, i64 0, i64 0), i64 2, i64 2, i8** nonnull %.5, i8** nonnull %.6)
  %.8 = icmp eq i32 %.7, 0
  br i1 %.8, label %entry.if, label %entry.endif, !prof !0

entry.if:                                         ; preds = %entry.endif.endif.endif.endif.endif, %entry.endif.endif.endif, %entry
  ret i8* null

entry.endif:                                      ; preds = %entry
  %.12 = load i8*, i8** @"_ZN08NumbaEnv8__main__11do_math$241Exx", align 8
  %.17 = icmp eq i8* %.12, null
  br i1 %.17, label %entry.endif.if, label %entry.endif.endif, !prof !0

entry.endif.if:                                   ; preds = %entry.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_RuntimeError, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @".const.missing Environment", i64 0, i64 0))
  ret i8* null

entry.endif.endif:                                ; preds = %entry.endif
  %.21 = load i8*, i8** %.5, align 8
  %.23 = call i8* @PyNumber_Long(i8* %.21)
  %.24 = icmp eq i8* %.23, null
  br i1 %.24, label %entry.endif.endif.endif, label %entry.endif.endif.if, !prof !0

entry.endif.endif.if:                             ; preds = %entry.endif.endif
  %.26 = call i64 @PyLong_AsLongLong(i8* nonnull %.23)
  call void @Py_DecRef(i8* nonnull %.23)
  br label %entry.endif.endif.endif

entry.endif.endif.endif:                          ; preds = %entry.endif.endif, %entry.endif.endif.if
  %.22.0 = phi i64 [ %.26, %entry.endif.endif.if ], [ undef, %entry.endif.endif ]
  %.31 = call i8* @PyErr_Occurred()
  %.32 = icmp eq i8* %.31, null
  br i1 %.32, label %entry.endif.endif.endif.endif, label %entry.if, !prof !1

entry.endif.endif.endif.endif:                    ; preds = %entry.endif.endif.endif
  %.36 = load i8*, i8** %.6, align 8
  %.38 = call i8* @PyNumber_Long(i8* %.36)
  %.39 = icmp eq i8* %.38, null
  br i1 %.39, label %entry.endif.endif.endif.endif.endif, label %entry.endif.endif.endif.endif.if, !prof !0

entry.endif.endif.endif.endif.if:                 ; preds = %entry.endif.endif.endif.endif
  %.41 = call i64 @PyLong_AsLongLong(i8* nonnull %.38)
  call void @Py_DecRef(i8* nonnull %.38)
  br label %entry.endif.endif.endif.endif.endif

entry.endif.endif.endif.endif.endif:              ; preds = %entry.endif.endif.endif.endif, %entry.endif.endif.endif.endif.if
  %.37.0 = phi i64 [ %.41, %entry.endif.endif.endif.endif.if ], [ undef, %entry.endif.endif.endif.endif ]
  %.46 = call i8* @PyErr_Occurred()
  %.47 = icmp eq i8* %.46, null
  br i1 %.47, label %entry.endif.endif.endif.endif.endif.endif, label %entry.if, !prof !1

entry.endif.endif.endif.endif.endif.endif:        ; preds = %entry.endif.endif.endif.endif.endif
  %.14.i = add nsw i64 %.37.0, %.22.0
  %.69 = call i8* @PyLong_FromLongLong(i64 %.14.i)
  ret i8* %.69
}

declare i32 @PyArg_UnpackTuple(i8*, i8*, i64, i64, ...) local_unnamed_addr

declare void @PyErr_SetString(i8*, i8*) local_unnamed_addr

declare i8* @PyNumber_Long(i8*) local_unnamed_addr

declare i64 @PyLong_AsLongLong(i8*) local_unnamed_addr

declare void @Py_DecRef(i8*) local_unnamed_addr

declare i8* @PyErr_Occurred() local_unnamed_addr

declare i8* @PyLong_FromLongLong(i64) local_unnamed_addr

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #1

attributes #0 = { norecurse nounwind }
attributes #1 = { nounwind }

!0 = !{!"branch_weights", i32 1, i32 99}
!1 = !{!"branch_weights", i32 99, i32 1}

(complex128, int64) ; ModuleID = 'do_math'
source_filename = "<string>"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin17.7.0"

@"_ZN08NumbaEnv8__main__11do_math$242E10complex128x" = common local_unnamed_addr global i8* null
@.const.do_math = internal constant [8 x i8] c"do_math\00"
@PyExc_RuntimeError = external global i8
@".const.missing Environment" = internal constant [20 x i8] c"missing Environment\00"
@PyExc_TypeError = external global i8
@".const.conversion to complex128 failed" = internal constant [32 x i8] c"conversion to complex128 failed\00"

; Function Attrs: norecurse nounwind
define i32 @"_ZN8__main__11do_math$242E10complex128x"({ double, double }* noalias nocapture %retptr, { i8*, i32 }** noalias nocapture readnone %excinfo, double %arg.x.0, double %arg.x.1, i64 %arg.y) local_unnamed_addr #0 {
entry:
  %.15 = sitofp i64 %arg.y to double
  %.39 = fadd double %.15, %arg.x.0
  %.42 = fadd double %arg.x.1, 0.000000e+00
  %retptr.repack3 = bitcast { double, double }* %retptr to double*
  store double %.39, double* %retptr.repack3, align 8
  %retptr.repack1 = getelementptr inbounds { double, double }, { double, double }* %retptr, i64 0, i32 1
  store double %.42, double* %retptr.repack1, align 8
  ret i32 0
}

define i8* @"_ZN7cpython8__main__11do_math$242E10complex128x"(i8* nocapture readnone %py_closure, i8* %py_args, i8* nocapture readnone %py_kws) local_unnamed_addr {
entry:
  %.5 = alloca i8*, align 8
  %.6 = alloca i8*, align 8
  %.7 = call i32 (i8*, i8*, i64, i64, ...) @PyArg_UnpackTuple(i8* %py_args, i8* getelementptr inbounds ([8 x i8], [8 x i8]* @.const.do_math, i64 0, i64 0), i64 2, i64 2, i8** nonnull %.5, i8** nonnull %.6)
  %.8 = icmp eq i32 %.7, 0
  %.22 = alloca { double, double }, align 8
  %0 = bitcast { double, double }* %.22 to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull %0, i8 0, i64 16, i32 8, i1 false)
  br i1 %.8, label %entry.if, label %entry.endif, !prof !0

entry.if:                                         ; preds = %entry.endif.endif.endif.thread, %entry.endif.endif.endif.endif.endif, %entry
  ret i8* null

entry.endif:                                      ; preds = %entry
  %.12 = load i8*, i8** @"_ZN08NumbaEnv8__main__11do_math$242E10complex128x", align 8
  %.17 = icmp eq i8* %.12, null
  br i1 %.17, label %entry.endif.if, label %entry.endif.endif, !prof !0

entry.endif.if:                                   ; preds = %entry.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_RuntimeError, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @".const.missing Environment", i64 0, i64 0))
  ret i8* null

entry.endif.endif:                                ; preds = %entry.endif
  %.21 = load i8*, i8** %.5, align 8
  %.24 = call i32 @numba_complex_adaptor(i8* %.21, { double, double }* nonnull %.22)
  %.25 = icmp eq i32 %.24, 0
  br i1 %.25, label %entry.endif.endif.endif.thread, label %entry.endif.endif.endif.endif, !prof !0

entry.endif.endif.endif.thread:                   ; preds = %entry.endif.endif
  call void @PyErr_SetString(i8* nonnull @PyExc_TypeError, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @".const.conversion to complex128 failed", i64 0, i64 0))
  br label %entry.if

entry.endif.endif.endif.endif:                    ; preds = %entry.endif.endif
  %1 = bitcast { double, double }* %.22 to double*
  %.29.fca.0.load = load double, double* %1, align 8
  %2 = bitcast { double, double }* %.22 to i8*
  %sunkaddr = getelementptr i8, i8* %2, i64 8
  %3 = bitcast i8* %sunkaddr to double*
  %.29.fca.1.load = load double, double* %3, align 8
  %.33 = load i8*, i8** %.6, align 8
  %.35 = call i8* @PyNumber_Long(i8* %.33)
  %.36 = icmp eq i8* %.35, null
  br i1 %.36, label %entry.endif.endif.endif.endif.endif, label %entry.endif.endif.endif.endif.if, !prof !0

entry.endif.endif.endif.endif.if:                 ; preds = %entry.endif.endif.endif.endif
  %.38 = call i64 @PyLong_AsLongLong(i8* nonnull %.35)
  call void @Py_DecRef(i8* nonnull %.35)
  %phitmp = sitofp i64 %.38 to double
  br label %entry.endif.endif.endif.endif.endif

entry.endif.endif.endif.endif.endif:              ; preds = %entry.endif.endif.endif.endif, %entry.endif.endif.endif.endif.if
  %.34.0 = phi double [ %phitmp, %entry.endif.endif.endif.endif.if ], [ 0.000000e+00, %entry.endif.endif.endif.endif ]
  %.43 = call i8* @PyErr_Occurred()
  %.44 = icmp eq i8* %.43, null
  br i1 %.44, label %entry.endif.endif.endif.endif.endif.endif, label %entry.if, !prof !1

entry.endif.endif.endif.endif.endif.endif:        ; preds = %entry.endif.endif.endif.endif.endif
  %.39.i = fadd double %.29.fca.0.load, %.34.0
  %.42.i = fadd double %.29.fca.1.load, 0.000000e+00
  %.74 = call i8* @PyComplex_FromDoubles(double %.39.i, double %.42.i)
  ret i8* %.74
}

declare i32 @PyArg_UnpackTuple(i8*, i8*, i64, i64, ...) local_unnamed_addr

declare void @PyErr_SetString(i8*, i8*) local_unnamed_addr

declare i32 @numba_complex_adaptor(i8*, { double, double }*) local_unnamed_addr

declare i8* @PyNumber_Long(i8*) local_unnamed_addr

declare i64 @PyLong_AsLongLong(i8*) local_unnamed_addr

declare void @Py_DecRef(i8*) local_unnamed_addr

declare i8* @PyErr_Occurred() local_unnamed_addr

declare i8* @PyComplex_FromDoubles(double, double) local_unnamed_addr

; Function Attrs: argmemonly nounwind
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i32, i1) #1

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #2

attributes #0 = { norecurse nounwind }
attributes #1 = { argmemonly nounwind }
attributes #2 = { nounwind }

!0 = !{!"branch_weights", i32 1, i32 99}
!1 = !{!"branch_weights", i32 99, i32 1}

But there's a caveat....

Compilation Options

Numba has two compilation modes:

  • nopython mode (recommended and best-practice way): produces much faster code by running the code without the involvement of the Python interpreter.

  • object mode (should be avoided): Numba falls back to this mode when nopython mode fails.

To illustrate the above, let's watch what happens when we try to do something that is natural in Python (concatenating strings), but not particularly mathematically sound:

In [16]:
%time do_math('Hello', 'World')
CPU times: user 394 ms, sys: 7.42 ms, total: 401 ms
Wall time: 406 ms
Out[16]:
'HelloWorld'
In [17]:
do_math.inspect_types()
do_math (int64, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: int64
    #   y = arg(1, name=y)  :: int64
    #   $0.3 = x + y  :: int64
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: int64
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================
do_math (complex128, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: complex128
    #   y = arg(1, name=y)  :: int64
    #   $0.3 = x + y  :: complex128
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: complex128
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================
do_math (unicode_type, unicode_type)
--------------------------------------------------------------------------------
# File: <ipython-input-2-10b52aae7618>
# --- LINE 1 --- 
# label 0

@jit

# --- LINE 2 --- 

def do_math(x, y):

    # --- LINE 3 --- 
    #   x = arg(0, name=x)  :: unicode_type
    #   y = arg(1, name=y)  :: unicode_type
    #   $0.3 = x + y  :: unicode_type
    #   del y
    #   del x
    #   $0.4 = cast(value=$0.3)  :: unicode_type
    #   del $0.3
    #   return $0.4

    return x + y


================================================================================

do_math (unicode_type, unicode_type) means that is has been compiled in object mode.

To prevent Numba from falling back, and instead raise an error, we need to pass nopython=True to @jit decorator:

In [18]:
@jit
def f(x, y): # Function will not befenit from Numba jit
    a = str(x) * 10 # Numba doesn't know about str
    b = str(y)
    return a + b 
In [19]:
%timeit f(1, 2)
1.54 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]:
@jit(nopython=True) # Fore nopython mode
def f(x, y): # Function will not befenit from Numba jit
    a = str(x) * 10 # Numba doesn't know about str
    b = str(y)
    return a + b 
In [21]:
%timeit f(1, 2)
-------------------------------------------------------------------------
TypingError                             Traceback (most recent call last)
<ipython-input-21-afb15eb559ed> in <module>
----> 1 get_ipython().run_line_magic('timeit', 'f(1, 2)')

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2285                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2286             with self.builtin_trap:
-> 2287                 result = fn(*args,**kwargs)
   2288             return result
   2289 

<decorator-gen-61> in timeit(self, line, cell, local_ns)

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell, local_ns)
   1129             for index in range(0, 10):
   1130                 number = 10 ** index
-> 1131                 time_number = timer.timeit(number)
   1132                 if time_number >= 0.2:
   1133                     break

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/IPython/core/magics/execution.py in timeit(self, number)
    158         gc.disable()
    159         try:
--> 160             timing = self.inner(it, self.timer)
    161         finally:
    162             if gcold:

<magic-timeit> in inner(_it, _timer)

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    346                 e.patch_message(msg)
    347 
--> 348             error_rewrite(e, 'typing')
    349         except errors.UnsupportedError as e:
    350             # Something unsupported is present in the user code, add help info

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
    313                 raise e
    314             else:
--> 315                 reraise(type(e), e, None)
    316 
    317         argtypes = []

~/opt/miniconda3/envs/py-dev/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
    656             value = tp()
    657         if value.__traceback__ is not tb:
--> 658             raise value.with_traceback(tb)
    659         raise value
    660 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'str': cannot determine Numba type of <class 'type'>

File "<ipython-input-20-d592c0643be0>", line 3:
def f(x, y): # Function will not befenit from Numba jit
    a = str(x) * 10 # Numba doesn't know about str
    ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

Benchmarks using the all pairwise distance function

Pure Python Version

In [22]:
def allpairs_distances_python(X,Y):
    result = np.zeros( (X.shape[0], Y.shape[0]), X.dtype)
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            result[i,j] = np.sum( (X[i,:] - Y[j,:]) ** 2)
    return result 
In [23]:
N = 1000 
X, Y = np.random.randn(200, N), np.random.randn(400, N)
X.shape, Y.shape 
Out[23]:
((200, 1000), (400, 1000))
In [24]:
pure_python = %timeit -o allpairs_distances_python(X, Y)
600 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Where is the bottleneck?

In [25]:
%load_ext line_profiler
In [26]:
%lprun -f allpairs_distances_python allpairs_distances_python(X,Y)
Timer unit: 1e-06 s

Total time: 0.944429 s
File: <ipython-input-22-9f51c1c25602>
Function: allpairs_distances_python at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def allpairs_distances_python(X,Y):
     2         1         38.0     38.0      0.0      result = np.zeros( (X.shape[0], Y.shape[0]), X.dtype)
     3       201         74.0      0.4      0.0      for i in range(X.shape[0]):
     4     80200      37347.0      0.5      4.0          for j in range(Y.shape[0]):
     5     80000     906970.0     11.3     96.0              result[i,j] = np.sum( (X[i,:] - Y[j,:]) ** 2)
     6         1          0.0      0.0      0.0      return result

Numba Version

In [27]:
from numba import jit

@jit(nopython=True)
def allpairs_distances_numba(X,Y):
    result = np.zeros((X.shape[0], Y.shape[0]), X.dtype)
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            result[i,j] = np.sum( (X[i,:] - Y[j,:]) ** 2)
    return result 

I should emphasize that this is the exact same code, except for numba's jit decorator. The results are pretty astonishing:

In [28]:
numba_version = %timeit -o allpairs_distances_numba(X,Y)
107 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]:
speedup = pure_python.best / numba_version.best 
In [30]:
print(f"This is a {round(speedup, 0)}x speedup, simply by adding a Numba decorator!")
This is a 6.0x speedup, simply by adding a Numba decorator!

Loops

While NumPy has developed a strong idiom around the use of vectorized operations, Numba is perfectly happy with loops too.

For C, and Fortran users, writing Python in this style will work fine in Numba (Thanks to LLVM!)

In [31]:
from numba import njit # njit is an alias for @jit(nopython=True)
In [32]:
# Pure NumPy
def ident_numpy(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

# Jitted NumPy 
@njit
def ident_numpy_jit(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

# NumPy with loops
def ident_numpy_loops(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i] ** 2 + np.sin(x[1]) ** 2)
        
    return r 

# Jitted NumPy with loops 
@njit
def ident_numpy_loops_jit(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i] ** 2 + np.sin(x[1]) ** 2)
        
    return r 
In [33]:
x = np.arange(1.e6)
In [34]:
%timeit ident_numpy(x)
34.6 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [35]:
%timeit ident_numpy_jit(x)
15.9 ms ± 588 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [36]:
%timeit ident_numpy_loops(x)
3.36 s ± 196 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]:
%timeit ident_numpy_loops_jit(x)
23.5 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Creating NumPy Universal Functions (Ufuncs)

  • Ufuncs are a core concept in NumPy for array-oriented computig.

    • A function with scalar inputs is broadcast across the elements of the input arrays:
In [38]:
np.add([1, 2, 3], 3)
Out[38]:
array([4, 5, 6])
In [39]:
np.add([1, 2, 3], [10, 20, 30])
Out[39]:
array([11, 22, 33])
  • Before Numba, creating fast ufuncs required writing C. This is no longer the case!

There's a tutorial on how to write ufuncs in NumPy from documentation, the example they post there is a ufunc to perform

$$f(a) = \log \left(\frac{a}{1-a}\right)$$

It looks like this:

static void double_logit(char **args, npy_intp *dimensions,
                            npy_intp* steps, void* data)
{
    npy_intp i;
    npy_intp n = dimensions[0];
    char *in = args[0], *out = args[1];
    npy_intp in_step = steps[0], out_step = steps[1];

    double tmp;

    for (i = 0; i < n; i++) {
        /*BEGIN main ufunc computation*/
        tmp = *(double *)in;
        tmp /= 1-tmp;
        *((double *)out) = log(tmp);
        /*END main ufunc computation*/

        in += in_step;
        out += out_step;
    }
}

NOTE: That's just for a double. If you want floats, long doubles, etc... you have to write all of those, too. And then create a setup.py file to install it, etc.

The @vectorize decorator

In [40]:
import math
In [41]:
def trig_func(x, y):
    return ((math.sin(x) ** 2) + (math.cos(y) ** 2))
In [42]:
trig_func(1, 1.5)
Out[42]:
0.7130771699733485

Seems reasonable. However, the math library only works on scalars. If we try to pass in arrays, we'll get an error.

In [43]:
trig_func([1, 2], [1, 2])
-------------------------------------------------------------------------
TypeError                               Traceback (most recent call last)
<ipython-input-43-d4bc937229c8> in <module>
----> 1 trig_func([1, 2], [1, 2])

<ipython-input-41-ea0655f3cd15> in trig_func(x, y)
      1 def trig_func(x, y):
----> 2     return ((math.sin(x) ** 2) + (math.cos(y) ** 2))

TypeError: must be real number, not list

Using @vectorize decorator, we are able to write our function as operating over input scalars, rather than arrays. Numba will generate teh surrounding loop (or kernel) allowing efficient iteration over the actual inputs.

In [44]:
from numba import vectorize, float64, float32, int32, int64
In [45]:
# Define ufunc with multiple signatures
@vectorize(['int32(int32, int32)',
            'int64(int64, int64)',
            'float32(float32, float32)',
            'float64(float64, float64)'])
def trig_func(x, y):
    return ((math.sin(x) ** 2) + (math.cos(y) ** 2))

And just like that, the scalar function trig_func is now a NumPy ufunc called trig_func

In [46]:
a = np.random.random((1000, 1000))
b = np.random.random((1000, 1000))
In [47]:
%timeit trig_func(a, b)
15.4 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [48]:
def trig_func_numpy(x, y):
    return ((np.sin(x) ** 2) + (np.cos(y) ** 2))

How does it compare to just using NumPy? Let's check

In [49]:
%timeit trig_func_numpy(a, b)
20.5 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

NOTE: NumPy ufuncs automatically get other features such as:

  • reduction
  • accumulation
  • broadcasting

By defining our ufunc using Numba, we get these additional features for free.

In [50]:
a = np.arange(12).reshape(3, 4)
a
Out[50]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [51]:
trig_func.reduce(a, axis=0)
Out[51]:
array([0, 0, 1, 0])
In [52]:
trig_func.accumulate(a)
Out[52]:
array([[0, 1, 2, 3],
       [0, 0, 1, 0],
       [0, 0, 1, 0]])
In [53]:
%load_ext watermark
In [54]:
%watermark -u -n -t -iv -g -m
numba     0.41.0
numpy     1.15.4
last updated: Sat Jan 12 2019 02:20:01 

compiler   : GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)
system     : Darwin
release    : 17.7.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : a433b6b13f7337165b615279fa4e146f8752ca3e

To Be Continued...................

The Sheer Joy of Accelerating Your Existing Python Code with Numba!

Part II : Numba for Cuda GPUs

Comments