Data Analytics Features of NumPy ✅

import numpy as np

1 Aggregation

In this section, we introduce the concept of aggregation, and cover a number of vectorized aggregation functions that come with the NumPy library.

1.1 Aggregation functions

x = np.arange(12).reshape(3, 4)
x

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.sum(x)

np.sum(x, axis=0)

array([12, 15, 18, 21])

np.sum(x, axis=1)

array([ 6, 22, 38])

1.2 Other aggregations

x.sum(axis=1)

array([ 6, 22, 38])

x.max()

x.max(axis=1)

array([ 3,  7, 11])

x.min(axis=1)

array([0, 4, 8])

x.prod(axis=1)

array([   0,  840, 7920])

x.any(axis=1)

array([ True,  True,  True])

x.all(axis=1)

array([False,  True,  True])

1.3 Statistical metrics as aggregations

np.mean(x)

5.5

np.median(x)

5.5

np.percentile(x, 75)

8.25

np.std(x)

3.452052529534663

np.var(x)

11.916666666666666

1.4 Argmin and Argmax

x = np.random.uniform(0, 1, (3,4)).round(2)
x

array([[0.97, 0.18, 0.87, 0.09],
       [0.52, 0.92, 0.12, 0.46],
       [0.48, 0.09, 0.85, 0.4 ]])

x.argmin(axis=0)

array([2, 2, 1, 0])

x.argmin(axis=1)

array([3, 2, 1])

x.argmin()

np.argmax(x, axis=0)

array([0, 1, 0, 1])

np.argmax(x, axis=1)

array([0, 1, 2])

np.argmax(x)

2 Selection using boolean arrays

To illustrate the concepts of boolean arrays and how to use them for selection, let’s consider an example.

Suppose we use the performance of 5 students over three different subjects:

            Index   Math    CS    Biology
Jack            0.    90    80     75
Jill            1.    93    89     87
Joe             2.    67    98.    88
Jason           3.    77.   89.    80
Jennifer        4.    83.   70.    95

grades = np.array([
    [90, 80, 75],
    [93, 95, 87],
    [67, 98, 88],
    [77, 89, 80],
    [93, 97, 95],
])

names = np.array([
    'Jack',
    'Jill',
    'Joe',
    'Jason',
    'Jennifer',
])

A boolean array can be obtained using various logical python predicates that are overloaded by Numpy.

== equality
<, >, <=, >=
np.logical_not
& and |

# here are the math grades

grades[:, 0]

array([90, 93, 67, 77, 83])

# boolean mask of who got A+ in math

grades[:, 0] >= 90

array([ True,  True, False, False, False])

# boolean mask can be used as a selection index

names[grades[:, 0] >= 90]

array(['Jack', 'Jill'], dtype='<U8')

We can use the logical predicates to express more complex selection conditions.

# boolean mask for A+ in math and CS.

(grades[:, 0] >= 90) & (grades[:, 1] >= 90)

array([False,  True, False, False,  True])

names[(grades[:, 0] >= 90) & (grades[:, 1] >= 90)]

array(['Jill', 'Jennifer'], dtype='<U8')

# boolean mask for A+ in math and CS, but not in biology

(grades[:, 0] >= 90) & (grades[:, 1] >= 90) & np.logical_not(grades[:, 2]>= 90)

array([False,  True, False, False, False])

names[
    (grades[:, 0] >= 90) &
    (grades[:, 1] >= 90) & 
    np.logical_not(grades[:, 2]>= 90)
]

array(['Jill'], dtype='<U8')