Lab 5: pandas
Objective
- To learn the basic of the pandas Python library.
Data structures
In pandas, there are two types of data structures:
| Structure | Description |
|---|---|
| Series | 1D labeled homogeneously-typed array |
| DataFrame | General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column |
Instruction
For all sections in this lab other than the last section, use the IPython console (located normally at the right bottom corner) to run the codes.
Imports
-
To import the
pandaslibrary,import pandas as pd -
NumPyis a dependency ofpandasand also a powerful Python library for scientific data processing. We may need to useNumPyfrom time to time. To importNumpy,import numpy as np -
It's common practice to import
pandasaspdandnumpyasnp. You would see this a lot if you tried to search for tutorials or solutions online. However, it's just a convention, it is fine to use other names.
Series
-
Creation
- from list,
s1 = pd.Series([1, 3, 5, np.nan, 6, 8]) s2 = pd.Series([1, 3, 5, np.nan, 6, 8], index=[1, 2, 3, 4, 5, 'f'])What is the difference between
s1ands2?- from dict,
d = {'a': 1, 'b': 2, 'c': 3} s3 = pd.Series(d)- from scalar value,
s4 = pd.Series(5, index=['a', 'b', 'c', 'd', 'e']) -
Indexing of
Series- try the following code to understand the getting and setting of a series with default indexing.
s = pd.Series(np.random.randn(5)) s[0] s[0] = 1.5 s- if the labels for the indices are specified,
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) s['a'] s[0] s['b'] = 1.8 s s[2] = 2 s
DataFrame
-
Creation
- from NumPy array
df = pd.DataFrame(np.random.randn(6,4))Indices and column names can be provided at creation.
df = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'), columns=list('ABCD'))- from a dict
run the following code and understand the functions used.
df2 = pd.DataFrame({ 'A': 1, 'B': pd.Timestamp('20190930'), 'C': pd.date_range('20190930', periods=4), 'D': pd.Series(1, index=list(range(4)), dtype='float32'), 'E': np.array([3]*4, dtype='int32'), 'F': pd.Categorical(['test', 'train', 'test', 'train']), 'G': 'foo' })dtypesof aDataFramecan be viewed usingdf2.dtypes. In IPython, tab completion is enabled for column names and public attributes. -
Data display
-
What do
df.head(0)anddf.tail()do? What happens if I usedf.head(3)anddf.tail(2)? -
df.indexdisplays the indices of a data frame. -
df.columnsdisplays the columns of a data frame. -
df.describe()shows a quick statistical summary of each column of the data.
-
-
Direct indexing
- to get a column,
df['A'] df.A
df[0]would not work.-
to select multiple columns,
df[['A', 'B']] -
to get a slice of rows
df[0:4] df['a':'d']
Is the indexing inclusive or exclusive?
- to get a column,
-
Selection by label With the following lines, identify how the function
.loc[...]worksdf.loc['a'] df.loc['a':'c'] df.loc[['a', 'c']] df.loc[:, 'A'] df.loc[:, ['A', 'B']] df.loc[:, 'A':'C'] df.loc['a':'c', ['A', 'B']] df.loc[['a', 'c'], ['A', 'B']] df.loc[['a', 'c'], 'A':'C'] df.loc['a':'c', 'A':'C'] df.loc['a', 'A'] df.loc['a', 'A':'C']df.at['a','A']is equivalent todf.loc['a','A'](only to get a scalar value)The object returned by a
.locis either a series (1-D), data frame (2-D), or scalar (single value). -
Selection by position
.ilocand.iatwork similarly as.locand.at. The only difference is that, instead of the label of the row/column, we will use the position of the row/column.Find the equivalent usage of
.ilocthat provides the same outputs as the previous lines for.loc. -
Boolean indexing Investigate the differences in the outputs of the following lines:
df[df.A > 0] df[df > 0]Filtering of a column can be done with
.isin.df2 = df.copy() df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] df2[df2['E'].isin(['two', 'four'])]
Data manipulation
pandas library provides a lot of functions to manipulate data.
Go to UCI datasets to download iris.data and iris.names from the Data Folder.
-
Load
iris.dataas a data frame (Hint:iris.datais a CSV file) -
Update the column names based on
iris.names. -
Calculate the mean, min, max, and standard deviation of each column.
-
Create a new column called
class valueusing the following code:Investigate the output ofdf['class value'] = pd.factorize(df['class'])[0]pd.factorize. -
Group the data according to the class. (Hint:
.groupby) -
Identify the function to extract each group using the name of the class.
-
Calculate the mean, min, max, and standard deviation of each column in each group.
-
Produce a scatter plot for any two columns using
matplotliblibrary. -
Identify the methods (at least 2) to loop through a data frame row by row.