{:check ["true"],
 :draft ["true"],
 :rank ["about_series" "Series" "about_dataframe" "DataFrame"]}

Index

Introduction to Series and DataFrame

About Series

Series

  • Series is a data structure provided by Pandas.

Series

In [1]:
import pandas as pd

Series

In [2]:
obj = pd.Series([4, 7, -5, 3])
In [3]:
obj
Out[3]:
0    4
1    7
2   -5
3    3
dtype: int64
In [4]:
obj.values
Out[4]:
array([ 4,  7, -5,  3])
In [5]:
obj.index
Out[5]:
RangeIndex(start=0, stop=4, step=1)

Pandas is an extension of numpy with semantic information.

  • We can create custom index data to indicate the meaning (or _semantics) of the values.
  • The index can be specified during construction.
In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['jack', 'jill', 'joe', 'albert'])
In [7]:
obj2
Out[7]:
jack      4
jill      7
joe      -5
albert    3
dtype: int64
In [9]:
# extract values given an index?

obj2['jack']
Out[9]:
4
In [10]:
obj2['albert']
Out[10]:
3
In [11]:
# We can make use of generalized numpy indexing syntax to extract
# sub series

obj2[['jack', 'jill', 'joe']]
Out[11]:
jack    4
jill    7
joe    -5
dtype: int64
In [13]:
# Construct boolean indexes based on a Numpy boolean array construction

obj2 > 0
Out[13]:
jack       True
jill       True
joe       False
albert     True
dtype: bool
In [14]:
# We can use the boolean series to index the series.
# (similar to numpy's indexing with boolean arrays)
obj2[obj2 > 0]
Out[14]:
jack      4
jill      7
albert    3
dtype: int64

We can construct a Series from a dictionary.

In [15]:
dict_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [16]:
dict_data
Out[16]:
{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [17]:
dict_data.keys()
Out[17]:
dict_keys(['Ohio', 'Texas', 'Oregon', 'Utah'])
In [18]:
obj3 = pd.Series(dict_data)
In [19]:
obj3
Out[19]:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Note:

  • dict_data does not have California count.
  • What if we insist that California is part of the index?
In [20]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(dict_data, index=states)
obj4
Out[20]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
In [21]:
#
# Check which index entries have missing value
# using pd.isnull(...)
#
pd.isnull(obj4)
Out[21]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [23]:
pd.notnull(obj4)
Out[23]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
In [24]:
#
# Extract just the non-missing value entries
#
obj4[pd.notnull(obj4)]
Out[24]:
Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64
In [25]:
obj4.isnull()
Out[25]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [26]:
obj4.notnull()
Out[26]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
In [27]:
obj3
Out[27]:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
In [28]:
obj4
Out[28]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
In [29]:
obj3 + obj4
Out[29]:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Pandas also allow us to update indexes after they are created.

In [30]:
obj
Out[30]:
0    4
1    7
2   -5
3    3
dtype: int64
In [31]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [32]:
obj
Out[32]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
In [35]:
obj.name = 'trading_income'
In [36]:
obj 
Out[36]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
Name: trading_income, dtype: int64
In [37]:
obj.index.name = 'customer_names'
In [38]:
obj
Out[38]:
customer_names
Bob      4
Steve    7
Jeff    -5
Ryan     3
Name: trading_income, dtype: int64
In [ ]:
 

About Dataframe

DataFrame

  • DataFrame is a data structure provided by Pandas to store and manipulate tabular data.

DataFrame

Dataframes

In [1]:
import pandas as pd

Construct a DataFrame object from a dictionary of values.

In [2]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data
Out[2]:
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
In [3]:
frame = pd.DataFrame(data)
In [4]:
frame
Out[4]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
In [6]:
# Individual columns can be extracted
# as a Pandas series

frame['state']
Out[6]:
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object
In [7]:
# Multiple columns can be extracted
# as a Pandas DataFrame
frame[['state', 'year']]
Out[7]:
state year
0 Ohio 2000
1 Ohio 2001
2 Ohio 2002
3 Nevada 2001
4 Nevada 2002

Another way of constructing a DataFrame is from a collection of rows.

In [8]:
data = [['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9]]
In [9]:
data
Out[9]:
[['Ohio', 2000, 1.5],
 ['Ohio', 2001, 1.7],
 ['Ohio', 2002, 3.6],
 ['Nevada', 2001, 2.4],
 ['Nevada', 2002, 2.9]]
In [12]:
frame2 = pd.DataFrame(data, columns=['state', 'year', 'pop'])
frame2
Out[12]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
In [ ]:
frame