{:check ["true"]}

Index

Aggregation

Aggregation

In [7]:
import numpy as np
np.set_printoptions(precision=2)

Aggregation functions and statistics

An aggregation function is a function that can map a collection of values to a single value:

Examples:

  • Mean
  • Max and min
  • Count (distinct)
  • ...
In [8]:
v = np.array([1, 2, 1, 3, 2, 4, 6, 5, 6.])
v
Out[8]:
array([1., 2., 1., 3., 2., 4., 6., 5., 6.])
In [9]:
v.mean()
Out[9]:
3.3333333333333335
In [10]:
np.mean(v)
Out[10]:
3.3333333333333335
In [11]:
np.max(v)
Out[11]:
6.0
In [12]:
np.min(v)
Out[12]:
1.0
In [13]:
len(np.unique(v))
Out[13]:
6

Aggregating higher dimensional ndarrays

In [14]:
M = np.arange(15).reshape(3,5)
M
Out[14]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The matrix $M$ has two axes. This creates multiple possible aggregations.

  1. Treat $M$ as a single collection of values.
  2. Aggregate each row to a single value. This means we aggregate along the column axis.
  3. Aggregate each column to a single value. This means we aggregate along the row axis.
In [15]:
# Sum up the whole matrix
np.sum(M)
Out[15]:
105
In [43]:
# Sum up along the row axis
np.sum(M, 0)
Out[43]:
array([15, 18, 21, 24, 27])
In [52]:
# Sum up along the column axis
np.sum(M, 1)
Out[52]:
array([10, 35, 60])

Making sense of the aggregation along an axis

  • Shape of M: (3, 5)
  • Y = np.sum(M, 0): collapses the row dimension to 1. So the resulting shape is (5,)
  • Y = np.sum(M, 1): collapses the column dimension to 1. The output shape is (3,).

Challenge

Numpy allows us to aggregate along multiple axes. Can you make sense of the following?

In [18]:
np.sum(M, ())
Out[18]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
In [19]:
np.sum(M, (0, 1))
Out[19]:
105

Let's make sure we understand the transformation:

  • shape of M is (3,5).
  • np.sum(M, ()) does not collapse any of the axes, so the output shape is unchanged: (3,5).
  • np.sum(M, (0, 1) collapses both axes, so the output shape has zero axes (), which is just a single value.

Statistics

A statistics of a collection of samples from a population is an aggregated value that reflects the underlying nature of the whole population.

  • Mininum and maximum
  • Percentile
  • Median
  • Mean
  • Standard deviation and variance

Minimum and maximum

  • Find the extremum value in a collection.
  • Find the index of the extremum.
In [20]:
nums = np.random.randn(10)
nums
Out[20]:
array([ 0.99, -0.46, -0.01, -0.4 , -1.49,  1.04,  1.42, -1.46,  1.53,
       -0.58])
In [27]:
# What is the minimum and maximum?
print("mininum =", nums.min())
print("maximum =", nums.max())
mininum = -1.4937588770497854
maximum = 1.5324831799800842
In [26]:
# But where are they?
i_min = nums.argmin()
i_max = nums.argmax()
print("nums[{}] = {:.2f}".format(i_min, nums[i_min]))
print("nums[{}] = {:.2f}".format(i_max, nums[i_max]))
nums[4] = -1.49
nums[8] = 1.53

Argmin and argmax for multiple axes

In [32]:
X = np.random.rand(3 * 5).reshape(3,5) * 100
X
Out[32]:
array([[43.59, 18.49, 12.23, 12.86, 21.38],
       [ 2.8 , 50.55, 99.16, 50.14, 86.03],
       [13.2 , 96.27, 86.9 , 74.85, 14.93]])
In [40]:
# By default, argmin flattens the ndarray
np.argmin(X)
Out[40]:
5
In [54]:
# We can also compute min-locations if we
# aggregate along an axis.

np.argmin(X, 0)
Out[54]:
array([1, 0, 0, 0, 2])

Percentile

Let $X$ be a collection of $N$ numbers $\{x_1, x_2, x_3, \dots, x_N\}$. Let $r$ be a percentage: $0 \leq r \leq 100$. A $r$-percentile is a value $c$ such that at least $\frac{r}{100}\cdot N$ values in $X$ are less than $c$.

$$ \frac{|\{x\in X: x\leq c\}|}{N} = \frac{r}{100} $$
In [68]:
X = np.random.rand(100)*100
X
Out[68]:
array([33.85,  5.49, 58.08, 50.72, 98.05,  6.3 , 23.28, 13.23, 87.54,
       58.78, 37.76,  5.01, 25.33, 22.58, 50.86, 92.44, 13.21,  5.67,
       56.53, 65.  ,  7.59, 16.23, 48.39,  0.13, 55.64, 55.71, 53.65,
       66.37, 16.45,  5.15, 11.9 ,  2.58,  3.51, 20.06, 61.64, 78.24,
       72.67, 40.07, 40.07, 98.72, 62.44, 14.73, 61.71, 47.65, 91.91,
       54.34, 58.46, 54.7 , 93.28, 90.3 , 79.86, 40.12, 63.1 , 68.83,
        2.65, 65.2 , 89.41, 29.56, 77.39, 36.49, 65.61, 42.64, 56.73,
       44.79, 79.68, 30.84, 31.94, 54.05, 76.08, 87.85, 70.07, 96.92,
       79.32, 22.21, 93.38, 58.72, 57.23, 51.86, 74.28, 62.02, 17.2 ,
       26.72, 13.74, 69.19, 57.  , 42.37, 57.83, 24.98, 40.36, 87.15,
        3.73,  1.68, 86.67, 74.74, 13.81, 33.4 , 38.89, 52.21, 67.  ,
       14.09])
In [69]:
np.percentile(X, 20)
Out[69]:
6.240106091615412
In [72]:
np.sum(X <= np.percentile(X, 20))
Out[72]:
20

Did you know?

  1. 0-percentile is the minimum.
  2. 100-percentile is the maximum.
In [75]:
print("0-percentile = {:.2f}".format(np.percentile(X, 0)))
print("100-percentile = {:.2f}".format(np.percentile(X, 100)))
0-percentile = 0.13
100-percentile = 98.72
In [80]:
print("min = {:.2f}".format(X.min()))
print("max = {:.2f}".format(X.max()))
min = 0.13
max = 98.72

Median vs Mean

  • The median is another name for 50-percentile.
  • The mean is the average.
In [83]:
np.median(X)
Out[83]:
52.93112626487185
In [86]:
np.percentile(X, 50)
Out[86]:
52.93112626487185
In [84]:
np.mean(X)
Out[84]:
48.05641980027419

Variations in Data

There is a family of statistics of samples that measure the fluctuation in the data.

  • Variance: $$\mathrm{var}(X) = \frac{1}{n}\sum_i (x_i - \mathrm{mean}(X))^2$$
  • Standard Deviation: $$\mathrm{std}(X) = \sqrt{\mathrm{var}(X)}$$
In [87]:
X
Out[87]:
array([33.85,  5.49, 58.08, 50.72, 98.05,  6.3 , 23.28, 13.23, 87.54,
       58.78, 37.76,  5.01, 25.33, 22.58, 50.86, 92.44, 13.21,  5.67,
       56.53, 65.  ,  7.59, 16.23, 48.39,  0.13, 55.64, 55.71, 53.65,
       66.37, 16.45,  5.15, 11.9 ,  2.58,  3.51, 20.06, 61.64, 78.24,
       72.67, 40.07, 40.07, 98.72, 62.44, 14.73, 61.71, 47.65, 91.91,
       54.34, 58.46, 54.7 , 93.28, 90.3 , 79.86, 40.12, 63.1 , 68.83,
        2.65, 65.2 , 89.41, 29.56, 77.39, 36.49, 65.61, 42.64, 56.73,
       44.79, 79.68, 30.84, 31.94, 54.05, 76.08, 87.85, 70.07, 96.92,
       79.32, 22.21, 93.38, 58.72, 57.23, 51.86, 74.28, 62.02, 17.2 ,
       26.72, 13.74, 69.19, 57.  , 42.37, 57.83, 24.98, 40.36, 87.15,
        3.73,  1.68, 86.67, 74.74, 13.81, 33.4 , 38.89, 52.21, 67.  ,
       14.09])
In [88]:
np.var(X)
Out[88]:
784.6036669040403
In [89]:
np.std(X)
Out[89]:
28.010777691882108

Note: std(X) has the same unit as X, and thus is easier to reason with.

Cumulative Sum and Product

Given a series of values $x_1, x_2, \dots, x_n$, the cumulative sum is another series defined as:

  • $y_0 = x_0$
  • $y_1 = x_0 + x_1$
  • $\vdots$
  • $y_i = \sum_{i=0}^{i} x_i$
  • $\vdots$

$[y_0, y_1, \dots, y_n]$ is called the cumulative sums of $[x_0, x_1, \dots, x_n]$.

The cumulative product is defined similarly.

In [93]:
xs = np.arange(10) + 1
xs
Out[93]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
In [94]:
np.cumsum(xs)
Out[94]:
array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55])
In [95]:
np.cumprod(xs)
Out[95]:
array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800])