import numpy as np
Relevant Python modules: Numpy
Motivations
Python does not cover the data structures normally used in science and technology work.
Numpy comes in to support data manipulation of n-dimensional arrays.
Extensive library of functions to reshape data.
Comprehensive collection of mathematical operations.
pip install numpy
default with Anaconda
Arrays
A computer version of vectors and matrices: sequence of uniform-type values with indexing mechanism by integers.
Numpy arrays have methods, applied element-wise, and functions that take into account the position of each element in the array.
# nr from 2 to 20 (excl.) with step 2
= np.arange(2, 20, 2)
b
b
array([ 2, 4, 6, 8, 10, 12, 14, 16, 18])
# element-wise operations
2*b
array([ 4, 8, 12, 16, 20, 24, 28, 32, 36])
# cumulative step-by-step sum
b.cumsum()
array([ 2, 6, 12, 20, 30, 42, 56, 72, 90])
Lists vs. Arrays
Same indexing notation:
0]
mylist[
0][1] mylistoflists[
A list is a generic sequence of heterogenous objects.
So, strings, numbers, characters, file name, URLs can be all mixed up!
An array is a sequence of strictly-homogenous objects, normally int
or float
1]
myarray[
1][3] mymatrix[
Notation
1-dimension: an array (a line of numbers): [1, 23, …]
2-dimensions: a matrix (a table of numbers) [ [1, 23, …], [14, 96, …], ...]
3-dimensions: a tensor (a box/cube/cuboid) of numbers: [ [ [1, 23, …], [14, 96, …], …], ...]
2-D Numpy Arrays
= np.arange(8)
c
c
array([0, 1, 2, 3, 4, 5, 6, 7])
# build a 2-dimensional array from a 1-d one
= np.array([c, c*2])
d
d
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 0, 2, 4, 6, 8, 10, 12, 14]])
# count elements
d.size
16
# size along each dimension
d.shape
(2, 8)
Axes
Numpy arrays can have multiple dimensions.
Unlike Pandas, not specifying the axis will apply a function to the entire array.
# operations along columns
d
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 0, 2, 4, 6, 8, 10, 12, 14]])
# operations along columns
sum(axis=0) d.
array([ 0, 3, 6, 9, 12, 15, 18, 21])
# summing by row
sum(axis=1) d.
array([28, 56])
# sum the whole content
sum() d.
84
Shapes
Using information about the shape we can create/manipulate (or reshape, or transpose) Numpy variables.
# Create 2x3 Numpy array and initialise it to 0s
= np.zeros((2, 3), dtype = 'i')
e
e
array([[0, 0, 0],
[0, 0, 0]], dtype=int32)
# Change the shape
3, 2) e.reshape(
array([[0, 0],
[0, 0],
[0, 0]], dtype=int32)
# Take another array to infer shape
= np.ones_like(e, dtype = 'i')
f
f
array([[1, 1, 1],
[1, 1, 1]], dtype=int32)
# Transposition
f.T
array([[1, 1],
[1, 1],
[1, 1]], dtype=int32)
Stacking
2-D arrays with the same dimensions can be merged
# Create an identity matrix of order 5
= np.eye(5)
i
i
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])
# stacking combines two 2-d arrays: vertically
np.vstack((i, i))
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])
# stacking combines two 2-d arrays: horizontally
np.hstack((i, i))
array([[1., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]])
Detour: N-dimensional arrays
Numpy can handle multiple dimensions.
This is useful when dealing with multivariate data, from time series to documents.
# N-dimensional array
= np.zeros((2, 3, 4))
g
g
array([[[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]],
[[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]]])
Two samples, each with three rows and four columns.
Slicing by Boolean filters
Data can be selected according to specific conditions.
The Boolean filter itself can be represented by a Numpy array
= np.array([np.arange(9)])
l
l
array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
3, 3))
l.reshape((
l
array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
# Let's apply a high-pass filter
>4] l[l
array([5, 6, 7, 8])
# Generate a Boolean array (False=0, True=1)
>4).astype(int) (l
array([[0, 0, 0, 0, 0, 1, 1, 1, 1]])
From Numpy to Pandas: where()
Even though Pandas is built on Numpy, where()
has a distinct semantics
Numpy allows specifying the respective action associated to True
and False
= np.array([np.arange(9)])
l
l
array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
# deserialise the arrat into a square matrix
= l.reshape((3, 3))
l
l
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
# If True then make it double, else halve it
<5, l*2, l/2) np.where(l
array([[0. , 2. , 4. ],
[6. , 8. , 2.5],
[3. , 3.5, 4. ]])
In Pandas, when False we assign n/a
Numpy func. to Pandas objects
import pandas as pd
# l is a Numpy matrix which readily interoperates with Pandas
= pd.DataFrame(l, columns=['A', 'B', 'C'])
my_df
my_df
A | B | C | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 3 | 4 | 5 |
2 | 6 | 7 | 8 |
# Extract the square root of each el. of column B (NB: my_df remains unchanged)
np.sqrt(my_df.B)
0 1.000000
1 2.000000
2 2.645751
Name: B, dtype: float64
Back and Forth b/w Pandas and Numpy
# Extract the values back into a Numpy object
= my_df.values
m
m
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])