import numpy as np
np.set_printoptions(precision=2)

Aggregation functions and statistics¶

An aggregation function is a function that can map a collection of values to a single value:

Examples:

Mean
Max and min
Count (distinct)
...

v = np.array([1, 2, 1, 3, 2, 4, 6, 5, 6.])
v

array([1., 2., 1., 3., 2., 4., 6., 5., 6.])

v.mean()

3.3333333333333335

np.mean(v)

3.3333333333333335

np.max(v)

6.0

np.min(v)

1.0

len(np.unique(v))

6

Aggregating higher dimensional ndarrays¶

M = np.arange(15).reshape(3,5)
M

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The matrix $M$ has two axes. This creates multiple possible aggregations.

Treat $M$ as a single collection of values.
Aggregate each row to a single value. This means we aggregate along the column axis.
Aggregate each column to a single value. This means we aggregate along the row axis.

# Sum up the whole matrix
np.sum(M)

105

# Sum up along the row axis
np.sum(M, 0)

array([15, 18, 21, 24, 27])

# Sum up along the column axis
np.sum(M, 1)

array([10, 35, 60])

Making sense of the aggregation along an axis¶

Shape of M: (3, 5)
Y = np.sum(M, 0): collapses the row dimension to 1. So the resulting shape is (5,)
Y = np.sum(M, 1): collapses the column dimension to 1. The output shape is (3,).

Challenge¶

Numpy allows us to aggregate along multiple axes. Can you make sense of the following?

np.sum(M, ())

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

np.sum(M, (0, 1))

105

Let's make sure we understand the transformation:

shape of M is (3,5).
np.sum(M, ()) does not collapse any of the axes, so the output shape is unchanged: (3,5).
np.sum(M, (0, 1) collapses both axes, so the output shape has zero axes (), which is just a single value.

Statistics¶

A statistics of a collection of samples from a population is an aggregated value that reflects the underlying nature of the whole population.

Mininum and maximum
Percentile
Median
Mean
Standard deviation and variance

Minimum and maximum¶

Find the extremum value in a collection.
Find the index of the extremum.

nums = np.random.randn(10)
nums

array([ 0.99, -0.46, -0.01, -0.4 , -1.49,  1.04,  1.42, -1.46,  1.53,
       -0.58])

# What is the minimum and maximum?
print("mininum =", nums.min())
print("maximum =", nums.max())

mininum = -1.4937588770497854
maximum = 1.5324831799800842

# But where are they?
i_min = nums.argmin()
i_max = nums.argmax()
print("nums[{}] = {:.2f}".format(i_min, nums[i_min]))
print("nums[{}] = {:.2f}".format(i_max, nums[i_max]))

nums[4] = -1.49
nums[8] = 1.53

Argmin and argmax for multiple axes¶

X = np.random.rand(3 * 5).reshape(3,5) * 100
X

array([[43.59, 18.49, 12.23, 12.86, 21.38],
       [ 2.8 , 50.55, 99.16, 50.14, 86.03],
       [13.2 , 96.27, 86.9 , 74.85, 14.93]])

# By default, argmin flattens the ndarray
np.argmin(X)

5

# We can also compute min-locations if we
# aggregate along an axis.

np.argmin(X, 0)

array([1, 0, 0, 0, 2])

Percentile¶

Let $X$ be a collection of $N$ numbers $\{x_1, x_2, x_3, \dots, x_N\}$. Let $r$ be a percentage: $0 \leq r \leq 100$. A $r$-percentile is a value $c$ such that at least $\frac{r}{100}\cdot N$ values in $X$ are less than $c$.

$$ \frac{|\{x\in X: x\leq c\}|}{N} = \frac{r}{100} $$

X = np.random.rand(100)*100
X

array([33.85,  5.49, 58.08, 50.72, 98.05,  6.3 , 23.28, 13.23, 87.54,
       58.78, 37.76,  5.01, 25.33, 22.58, 50.86, 92.44, 13.21,  5.67,
       56.53, 65.  ,  7.59, 16.23, 48.39,  0.13, 55.64, 55.71, 53.65,
       66.37, 16.45,  5.15, 11.9 ,  2.58,  3.51, 20.06, 61.64, 78.24,
       72.67, 40.07, 40.07, 98.72, 62.44, 14.73, 61.71, 47.65, 91.91,
       54.34, 58.46, 54.7 , 93.28, 90.3 , 79.86, 40.12, 63.1 , 68.83,
        2.65, 65.2 , 89.41, 29.56, 77.39, 36.49, 65.61, 42.64, 56.73,
       44.79, 79.68, 30.84, 31.94, 54.05, 76.08, 87.85, 70.07, 96.92,
       79.32, 22.21, 93.38, 58.72, 57.23, 51.86, 74.28, 62.02, 17.2 ,
       26.72, 13.74, 69.19, 57.  , 42.37, 57.83, 24.98, 40.36, 87.15,
        3.73,  1.68, 86.67, 74.74, 13.81, 33.4 , 38.89, 52.21, 67.  ,
       14.09])

np.percentile(X, 20)

6.240106091615412

np.sum(X <= np.percentile(X, 20))

20

Did you know?¶

0-percentile is the minimum.
100-percentile is the maximum.

print("0-percentile = {:.2f}".format(np.percentile(X, 0)))
print("100-percentile = {:.2f}".format(np.percentile(X, 100)))

0-percentile = 0.13
100-percentile = 98.72

print("min = {:.2f}".format(X.min()))
print("max = {:.2f}".format(X.max()))

min = 0.13
max = 98.72

Median vs Mean¶

The median is another name for 50-percentile.
The mean is the average.

np.median(X)

52.93112626487185

np.percentile(X, 50)

52.93112626487185

np.mean(X)

48.05641980027419

Variations in Data¶

There is a family of statistics of samples that measure the fluctuation in the data.

Variance: $$\mathrm{var}(X) = \frac{1}{n}\sum_i (x_i - \mathrm{mean}(X))^2$$
Standard Deviation: $$\mathrm{std}(X) = \sqrt{\mathrm{var}(X)}$$

X

array([33.85,  5.49, 58.08, 50.72, 98.05,  6.3 , 23.28, 13.23, 87.54,
       58.78, 37.76,  5.01, 25.33, 22.58, 50.86, 92.44, 13.21,  5.67,
       56.53, 65.  ,  7.59, 16.23, 48.39,  0.13, 55.64, 55.71, 53.65,
       66.37, 16.45,  5.15, 11.9 ,  2.58,  3.51, 20.06, 61.64, 78.24,
       72.67, 40.07, 40.07, 98.72, 62.44, 14.73, 61.71, 47.65, 91.91,
       54.34, 58.46, 54.7 , 93.28, 90.3 , 79.86, 40.12, 63.1 , 68.83,
        2.65, 65.2 , 89.41, 29.56, 77.39, 36.49, 65.61, 42.64, 56.73,
       44.79, 79.68, 30.84, 31.94, 54.05, 76.08, 87.85, 70.07, 96.92,
       79.32, 22.21, 93.38, 58.72, 57.23, 51.86, 74.28, 62.02, 17.2 ,
       26.72, 13.74, 69.19, 57.  , 42.37, 57.83, 24.98, 40.36, 87.15,
        3.73,  1.68, 86.67, 74.74, 13.81, 33.4 , 38.89, 52.21, 67.  ,
       14.09])

np.var(X)

784.6036669040403

np.std(X)

28.010777691882108

Note: std(X) has the same unit as X, and thus is easier to reason with.

Cumulative Sum and Product¶

Given a series of values $x_1, x_2, \dots, x_n$, the cumulative sum is another series defined as:

$y_0 = x_0$
$y_1 = x_0 + x_1$
$\vdots$
$y_i = \sum_{i=0}^{i} x_i$
$\vdots$

$[y_0, y_1, \dots, y_n]$ is called the cumulative sums of $[x_0, x_1, \dots, x_n]$.

The cumulative product is defined similarly.

xs = np.arange(10) + 1
xs

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

np.cumsum(xs)

array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55])

np.cumprod(xs)

array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800])

Index

Aggregation