{:check ["true"]}
import numpy as np
np.set_printoptions(precision=2)
An aggregation function is a function that can map a collection of values to a single value:
Examples:
v = np.array([1, 2, 1, 3, 2, 4, 6, 5, 6.])
v
v.mean()
np.mean(v)
np.max(v)
np.min(v)
len(np.unique(v))
M = np.arange(15).reshape(3,5)
M
The matrix $M$ has two axes. This creates multiple possible aggregations.
# Sum up the whole matrix
np.sum(M)
# Sum up along the row axis
np.sum(M, 0)
# Sum up along the column axis
np.sum(M, 1)
(3, 5)
Y = np.sum(M, 0)
: collapses the row dimension to 1
. So the resulting shape is (5,)
Y = np.sum(M, 1)
: collapses the column dimension to 1
. The output shape is (3,)
.Numpy allows us to aggregate along multiple axes. Can you make sense of the following?
np.sum(M, ())
np.sum(M, (0, 1))
Let's make sure we understand the transformation:
(3,5)
.np.sum(M, ())
does not collapse any of the axes, so the output shape is unchanged: (3,5)
.np.sum(M, (0, 1)
collapses both axes, so the output shape has zero axes ()
, which is just a single value.A statistics of a collection of samples from a population is an aggregated value that reflects the underlying nature of the whole population.
nums = np.random.randn(10)
nums
# What is the minimum and maximum?
print("mininum =", nums.min())
print("maximum =", nums.max())
# But where are they?
i_min = nums.argmin()
i_max = nums.argmax()
print("nums[{}] = {:.2f}".format(i_min, nums[i_min]))
print("nums[{}] = {:.2f}".format(i_max, nums[i_max]))
X = np.random.rand(3 * 5).reshape(3,5) * 100
X
# By default, argmin flattens the ndarray
np.argmin(X)
# We can also compute min-locations if we
# aggregate along an axis.
np.argmin(X, 0)
Let $X$ be a collection of $N$ numbers $\{x_1, x_2, x_3, \dots, x_N\}$. Let $r$ be a percentage: $0 \leq r \leq 100$. A $r$-percentile is a value $c$ such that at least $\frac{r}{100}\cdot N$ values in $X$ are less than $c$.
$$ \frac{|\{x\in X: x\leq c\}|}{N} = \frac{r}{100} $$X = np.random.rand(100)*100
X
np.percentile(X, 20)
np.sum(X <= np.percentile(X, 20))
print("0-percentile = {:.2f}".format(np.percentile(X, 0)))
print("100-percentile = {:.2f}".format(np.percentile(X, 100)))
print("min = {:.2f}".format(X.min()))
print("max = {:.2f}".format(X.max()))
np.median(X)
np.percentile(X, 50)
np.mean(X)
There is a family of statistics of samples that measure the fluctuation in the data.
X
np.var(X)
np.std(X)
Note: std(X)
has the same unit as X
, and thus is easier to reason with.
Given a series of values $x_1, x_2, \dots, x_n$, the cumulative sum is another series defined as:
$[y_0, y_1, \dots, y_n]$ is called the cumulative sums of $[x_0, x_1, \dots, x_n]$.
The cumulative product is defined similarly.
xs = np.arange(10) + 1
xs
np.cumsum(xs)
np.cumprod(xs)