Lab 5: pandas
Objective
- To learn the basic of the pandas Python library.
Data structures
In pandas, there are two types of data structures:
Structure | Description |
---|---|
Series | 1D labeled homogeneously-typed array |
DataFrame | General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column |
Instruction
For all sections in this lab other than the last section, use the IPython console (located normally at the right bottom corner) to run the codes.
Imports
-
To import the
pandas
library,import pandas as pd
-
NumPy
is a dependency ofpandas
and also a powerful Python library for scientific data processing. We may need to useNumPy
from time to time. To importNumpy
,import numpy as np
-
It's common practice to import
pandas
aspd
andnumpy
asnp
. You would see this a lot if you tried to search for tutorials or solutions online. However, it's just a convention, it is fine to use other names.
Series
-
Creation
- from list,
s1 = pd.Series([1, 3, 5, np.nan, 6, 8]) s2 = pd.Series([1, 3, 5, np.nan, 6, 8], index=[1, 2, 3, 4, 5, 'f'])
What is the difference between
s1
ands2
?- from dict,
d = {'a': 1, 'b': 2, 'c': 3} s3 = pd.Series(d)
- from scalar value,
s4 = pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])
-
Indexing of
Series
- try the following code to understand the getting and setting of a series with default indexing.
s = pd.Series(np.random.randn(5)) s[0] s[0] = 1.5 s
- if the labels for the indices are specified,
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) s['a'] s[0] s['b'] = 1.8 s s[2] = 2 s
DataFrame
-
Creation
- from NumPy array
df = pd.DataFrame(np.random.randn(6,4))
Indices and column names can be provided at creation.
df = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'), columns=list('ABCD'))
- from a dict
run the following code and understand the functions used.
df2 = pd.DataFrame({ 'A': 1, 'B': pd.Timestamp('20190930'), 'C': pd.date_range('20190930', periods=4), 'D': pd.Series(1, index=list(range(4)), dtype='float32'), 'E': np.array([3]*4, dtype='int32'), 'F': pd.Categorical(['test', 'train', 'test', 'train']), 'G': 'foo' })
dtypes
of aDataFrame
can be viewed usingdf2.dtypes
. In IPython, tab completion is enabled for column names and public attributes. -
Data display
-
What do
df.head(0)
anddf.tail()
do? What happens if I usedf.head(3)
anddf.tail(2)
? -
df.index
displays the indices of a data frame. -
df.columns
displays the columns of a data frame. -
df.describe()
shows a quick statistical summary of each column of the data.
-
-
Direct indexing
- to get a column,
df['A'] df.A
df[0]
would not work.-
to select multiple columns,
df[['A', 'B']]
-
to get a slice of rows
df[0:4] df['a':'d']
Is the indexing inclusive or exclusive?
- to get a column,
-
Selection by label With the following lines, identify how the function
.loc[...]
worksdf.loc['a'] df.loc['a':'c'] df.loc[['a', 'c']] df.loc[:, 'A'] df.loc[:, ['A', 'B']] df.loc[:, 'A':'C'] df.loc['a':'c', ['A', 'B']] df.loc[['a', 'c'], ['A', 'B']] df.loc[['a', 'c'], 'A':'C'] df.loc['a':'c', 'A':'C'] df.loc['a', 'A'] df.loc['a', 'A':'C']
df.at['a','A']
is equivalent todf.loc['a','A']
(only to get a scalar value)The object returned by a
.loc
is either a series (1-D), data frame (2-D), or scalar (single value). -
Selection by position
.iloc
and.iat
work similarly as.loc
and.at
. The only difference is that, instead of the label of the row/column, we will use the position of the row/column.Find the equivalent usage of
.iloc
that provides the same outputs as the previous lines for.loc
. -
Boolean indexing Investigate the differences in the outputs of the following lines:
df[df.A > 0] df[df > 0]
Filtering of a column can be done with
.isin
.df2 = df.copy() df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] df2[df2['E'].isin(['two', 'four'])]
Data manipulation
pandas
library provides a lot of functions to manipulate data.
Go to UCI datasets to download iris.data
and iris.names
from the Data Folder
.
-
Load
iris.data
as a data frame (Hint:iris.data
is a CSV file) -
Update the column names based on
iris.names
. -
Calculate the mean, min, max, and standard deviation of each column.
-
Create a new column called
class value
using the following code:Investigate the output ofdf['class value'] = pd.factorize(df['class'])[0]
pd.factorize
. -
Group the data according to the class. (Hint:
.groupby
) -
Identify the function to extract each group using the name of the class.
-
Calculate the mean, min, max, and standard deviation of each column in each group.
-
Produce a scatter plot for any two columns using
matplotlib
library. -
Identify the methods (at least 2) to loop through a data frame row by row.