Pandas Intro¶

Pandas stands for python annotated n-dimensional arrays, which describes its functionality pretty well. Pandas is a library for reading, changing, creating and analysing data. Thanks to Pandas being built on numpy it also has very high performance.

In [36]:

Copied!

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd

Pandas consists of mainly 2 data structures, Series and Dataframes.

Series¶

A Series is a 1D labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.) however it can only contain one data type at a time. A Series also has an index, a list of axis labels or identifiers, which you can see on the left.

In simple words you can imagine a Series as a column.

From List¶

In [37]:

Copied!

series_list = pd.Series([1, 2, 3, 4, 5, 6])
series_list
series_list = pd.Series([1, 2, 3, 4, 5, 6])
series_list

Out[37]:

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [38]:

Copied!

type(series_list)
type(series_list)

Out[38]:

pandas.core.series.Series

Like a numpy array, a pandas Series has a dtype. This is often a numpy dtype but pandas has also added a few of its own.

In [39]:

Copied!

series_list.dtype
series_list.dtype

Out[39]:

dtype('int64')

Also just like a numpy array, you can look at the shape i.e the dimensions of a Series/DataFrame.

In [40]:

Copied!

series_string = pd.Series(["Bob", "John", "Joe"])
series_string.shape
series_string = pd.Series(["Bob", "John", "Joe"])
series_string.shape

Out[40]:

(3,)

In [41]:

Copied!

series_np = pd.Series(np.array([10, 20, 30, 40, 50, 60]))
series_np
series_np = pd.Series(np.array([10, 20, 30, 40, 50, 60]))
series_np

Out[41]:

0    10
1    20
2    30
3    40
4    50
5    60
dtype: int32

From Dictionary¶

In [42]:

Copied!

series_dict = pd.Series({"a": 0.0, "b": 1.0, "c": 2.0})
series_dict
series_dict = pd.Series({"a": 0.0, "b": 1.0, "c": 2.0})
series_dict

Out[42]:

a    0.0
b    1.0
c    2.0
dtype: float64

Custom indexes¶

Pandas generates by default an index for us which is a sequence of integers but this can be changed.

In [43]:

Copied!





indexes = np.arange(0, 5, 1)
series_index = pd.Series(
    [2 ** x for x in indexes],
    index=indexes
)
series_index
indexes = np.arange(0, 5, 1)
series_index = pd.Series(
    [2 ** x for x in indexes],
    index=indexes
)
series_index

Out[43]:

0     1
1     2
2     4
3     8
4    16
dtype: int32

In [44]:

Copied!

series_string_index = pd.Series(np.arange(1, 6, 1), index=["a", "b", "c", "d", "e"])
series_string_index
series_string_index = pd.Series(np.arange(1, 6, 1), index=["a", "b", "c", "d", "e"])
series_string_index

Out[44]:

a    1
b    2
c    3
d    4
e    5
dtype: int32

Iterating over Series¶

Just like most other data structures in python, it is possible to iterate over them using simple for loops

In [45]:

Copied!

for value in series_string_index:
    print(value)
for value in series_string_index:
    print(value)

In [46]:

Copied!

for key in series_string_index.keys():  # key being the row index
    print(key)
for key in series_string_index.keys():  # key being the row index
    print(key)

a
b
c
d
e

In [47]:

Copied!

for index in series_string_index.index:
    print(index)
for index in series_string_index.index:
    print(index)

a
b
c
d
e

DataFrame¶

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series/List objects.

If we consider column names as keys and lists of items under that column as values, we can easily use a python dict to represent a DataFrame. Just like with the Series if we haven’t provided a row index it automatically generates a sequence of integers.

From dict of lists/numpy arrays¶

In [48]:

Copied!





my_dict = {
    "name": ["a", "b", "c", "d", "e", "f", "g"],
    "age": [20, 27, 35, 55, 18, 21, 35],
    "designation": ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}

df_dict = pd.DataFrame(my_dict)
df_dict
my_dict = {
    "name": ["a", "b", "c", "d", "e", "f", "g"],
    "age": [20, 27, 35, 55, 18, 21, 35],
    "designation": ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}

df_dict = pd.DataFrame(my_dict)
df_dict

Out[48]:

	name	age	designation
0	a	20	VP
1	b	27	CEO
2	c	35	CFO
3	d	55	VP
4	e	18	VP
5	f	21	CEO
6	g	35	MD

In [49]:

Copied!





df_index = pd.DataFrame(
    my_dict,
    index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"]
)
df_index
df_index = pd.DataFrame(
    my_dict,
    index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"]
)
df_index

Out[49]:

	name	age	designation
First	a	20	VP
Second	b	27	CEO
Third	c	35	CFO
Fourth	d	55	VP
Fifth	e	18	VP
Sixth	f	21	CEO
Seventh	g	35	MD

From CSV¶

In [50]:

Copied!

df_cars = pd.read_csv("data/cars.csv", sep=";", skiprows=[1])
df_cars
df_cars = pd.read_csv("data/cars.csv", sep=";", skiprows=[1])
df_cars

Out[50]:

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns

Meta Information¶

We can also look at some basic information of our DataFrame.

In [51]:

Copied!

df_cars.shape
df_cars.shape

Out[51]:

(406, 9)

In [52]:

Copied!

df_cars.info()
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           406 non-null    object 
 1   MPG           406 non-null    float64
 2   Cylinders     406 non-null    int64  
 3   Displacement  406 non-null    float64
 4   Horsepower    406 non-null    float64
 5   Weight        406 non-null    float64
 6   Acceleration  406 non-null    float64
 7   Model         406 non-null    int64  
 8   Origin        406 non-null    object 
dtypes: float64(5), int64(2), object(2)
memory usage: 28.7+ KB

If the dtype is object then it is in most cases a string. We can also look at the first or last data entries

In [53]:

Copied!

df_cars.head()  # First 10
df_cars.head(5)  # First 5
df_cars.head()  # First 10
df_cars.head(5)  # First 5

Out[53]:

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US

In [54]:

Copied!

df_cars.tail()  # Last 10
df_cars.tail(5)  # Last 5
df_cars.tail()  # Last 10
df_cars.tail(5)  # Last 5

Out[54]:

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

Accesing Data¶

In [55]:

Copied!

df_cars.columns
df_cars.columns

Out[55]:

Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model', 'Origin'],
      dtype='object')

In [56]:

Copied!

df_cars["Car"]
df_cars["Car"]

Out[56]:

0      Chevrolet Chevelle Malibu
1              Buick Skylark 320
2             Plymouth Satellite
3                  AMC Rebel SST
4                    Ford Torino
                 ...            
401              Ford Mustang GL
402            Volkswagen Pickup
403                Dodge Rampage
404                  Ford Ranger
405                   Chevy S-10
Name: Car, Length: 406, dtype: object

This is also works but other is preferred as this can cause clashes with existing DataFrame attributes for example count

In [57]:

Copied!

df_cars.Car
df_cars.Car

Out[57]:

0      Chevrolet Chevelle Malibu
1              Buick Skylark 320
2             Plymouth Satellite
3                  AMC Rebel SST
4                    Ford Torino
                 ...            
401              Ford Mustang GL
402            Volkswagen Pickup
403                Dodge Rampage
404                  Ford Ranger
405                   Chevy S-10
Name: Car, Length: 406, dtype: object

You can also get Multiple columns.

In [58]:

Copied!

df_cars[["Car", "MPG", "Cylinders"]]
df_cars[["Car", "MPG", "Cylinders"]]

Out[58]:

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8
...	...	...	...
401	Ford Mustang GL	27.0	4
402	Volkswagen Pickup	44.0	4
403	Dodge Rampage	32.0	4
404	Ford Ranger	28.0	4
405	Chevy S-10	31.0	4

406 rows × 3 columns

The iloc function stands for integer location and allows us to easly access data by using there indexes.

In [59]:

Copied!

df_cars.iloc[0]  # row with index 0
df_cars.iloc[0]  # row with index 0

Out[59]:

Car             Chevrolet Chevelle Malibu
MPG                                  18.0
Cylinders                               8
Displacement                        307.0
Horsepower                          130.0
Weight                             3504.0
Acceleration                         12.0
Model                                  70
Origin                                 US
Name: 0, dtype: object

We can also multiple rows and specify certain columns but as we are using iloc it has to be the indexes of the columns.

In [60]:

Copied!

df_cars.iloc[[0, 1], [0, 2, 3]]  # row 0 and 1 with 3 columns
df_cars.iloc[[0, 1], [0, 2, 3]]  # row 0 and 1 with 3 columns

Out[60]:

	Car	Cylinders	Displacement
0	Chevrolet Chevelle Malibu	8	307.0
1	Buick Skylark 320	8	350.0

We can also us slicing just like with python lists.

In [61]:

Copied!

df_cars.iloc[0:5, 0:3]
df_cars.iloc[0:5, 0:3]

Out[61]:

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8

To access data there is also the loc function which works the same way as iloc but uses the labels not the indexes.

In [62]:

Copied!

df_cars.loc[[0, 1], ["Car", "Cylinders"]]
df_cars.loc[[0, 1], ["Car", "Cylinders"]]

Out[62]:

	Car	Cylinders
0	Chevrolet Chevelle Malibu	8
1	Buick Skylark 320	8

In [63]:

Copied!

df_index.loc[["First", "Second"]]
df_index.loc[["First", "Second"]]

Out[63]:

	name	age	designation
First	a	20	VP
Second	b	27	CEO

You can also use slices with loc important here is however that the end values is inclusive!

In [64]:

Copied!

df_cars.loc[0:5, "Car":"Cylinders"]
df_cars.loc[0:5, "Car":"Cylinders"]

Out[64]:

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8
5	Ford Galaxie 500	15.0	8

In [65]:

Copied!

df_cars["Cylinders"].value_counts()
df_cars["Cylinders"].value_counts()

Out[65]:

4    207
8    108
6     84
3      4
5      3
Name: Cylinders, dtype: int64

Indexes¶

As we have seen so far indexes are identifiers for rows. By default an index for a DataFrame or Series is just an integer sequence but as we have seen it can also be something different for example we can set a certain column as the index.

In [66]:

Copied!

df_cars.set_index("Car")
df_cars.set_index("Car")

Out[66]:

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...
Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 8 columns

We can however see that this change is not permenant. This is often the case in pandas when operating on Series or DataFrames. If we wanted these changes to be permenant we could call the method with the inplace parameter as True.

In [67]:

Copied!

df_cars
df_cars

Out[67]:

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns

In [68]:

Copied!

df_cars.set_index("Car", inplace=True)
df_cars
df_cars.set_index("Car", inplace=True)
df_cars

Out[68]:

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...
Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 8 columns

You can also sort the rows by using the index

In [69]:

Copied!

df_cars.sort_index()
df_cars.sort_index()

Out[69]:

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
AMC Ambassador Brougham	13.0	8	360.0	175.0	3821.0	11.0	73	US
AMC Ambassador DPL	15.0	8	390.0	190.0	3850.0	8.5	70	US
AMC Ambassador SST	17.0	8	304.0	150.0	3672.0	11.5	72	US
AMC Concord	19.4	6	232.0	90.0	3210.0	17.2	78	US
AMC Concord	24.3	4	151.0	90.0	3003.0	20.1	80	US
...	...	...	...	...	...	...	...	...
Volvo 145e (sw)	18.0	4	121.0	112.0	2933.0	14.5	72	Europe
Volvo 244DL	22.0	4	121.0	98.0	2945.0	14.5	75	Europe
Volvo 245	20.0	4	130.0	102.0	3150.0	15.7	76	Europe
Volvo 264gl	17.0	6	163.0	125.0	3140.0	13.6	78	Europe
Volvo Diesel	30.7	6	145.0	76.0	3160.0	19.6	81	Europe

406 rows × 8 columns

In [70]:

Copied!

df_cars.reset_index(inplace=True)
df_cars
df_cars.reset_index(inplace=True)
df_cars

Out[70]:

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns