Pandas Intro¶
Pandas stands for python annotated n-dimensional arrays, which describes its functionality pretty well. Pandas is a library for reading, changing, creating and analysing data. Thanks to Pandas being built on numpy it also has very high performance.
import numpy as np
import pandas as pd
Pandas consists of mainly 2 data structures, Series and Dataframes.
Series¶
A Series is a 1D labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.) however it can only contain one data type at a time. A Series also has an index, a list of axis labels or identifiers, which you can see on the left.
In simple words you can imagine a Series as a column.
From List¶
series_list = pd.Series([1, 2, 3, 4, 5, 6])
series_list
0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
type(series_list)
pandas.core.series.Series
Like a numpy array, a pandas Series has a dtype. This is often a numpy dtype but pandas has also added a few of its own.
series_list.dtype
dtype('int64')
Also just like a numpy array, you can look at the shape i.e the dimensions of a Series/DataFrame.
series_string = pd.Series(["Bob", "John", "Joe"])
series_string.shape
(3,)
series_np = pd.Series(np.array([10, 20, 30, 40, 50, 60]))
series_np
0 10 1 20 2 30 3 40 4 50 5 60 dtype: int32
From Dictionary¶
series_dict = pd.Series({"a": 0.0, "b": 1.0, "c": 2.0})
series_dict
a 0.0 b 1.0 c 2.0 dtype: float64
Custom indexes¶
Pandas generates by default an index for us which is a sequence of integers but this can be changed.
indexes = np.arange(0, 5, 1)
series_index = pd.Series(
[2 ** x for x in indexes],
index=indexes
)
series_index
0 1 1 2 2 4 3 8 4 16 dtype: int32
series_string_index = pd.Series(np.arange(1, 6, 1), index=["a", "b", "c", "d", "e"])
series_string_index
a 1 b 2 c 3 d 4 e 5 dtype: int32
Iterating over Series¶
Just like most other data structures in python, it is possible to iterate over them using simple for loops
for value in series_string_index:
print(value)
1 2 3 4 5
for key in series_string_index.keys(): # key being the row index
print(key)
a b c d e
for index in series_string_index.index:
print(index)
a b c d e
DataFrame¶
A DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series/List objects.
If we consider column names as keys and lists of items under that column as values, we can easily use a python dict to represent a DataFrame. Just like with the Series if we haven’t provided a row index it automatically generates a sequence of integers.
From dict of lists/numpy arrays¶
my_dict = {
"name": ["a", "b", "c", "d", "e", "f", "g"],
"age": [20, 27, 35, 55, 18, 21, 35],
"designation": ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
df_dict = pd.DataFrame(my_dict)
df_dict
name | age | designation | |
---|---|---|---|
0 | a | 20 | VP |
1 | b | 27 | CEO |
2 | c | 35 | CFO |
3 | d | 55 | VP |
4 | e | 18 | VP |
5 | f | 21 | CEO |
6 | g | 35 | MD |
df_index = pd.DataFrame(
my_dict,
index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"]
)
df_index
name | age | designation | |
---|---|---|---|
First | a | 20 | VP |
Second | b | 27 | CEO |
Third | c | 35 | CFO |
Fourth | d | 55 | VP |
Fifth | e | 18 | VP |
Sixth | f | 21 | CEO |
Seventh | g | 35 | MD |
From CSV¶
df_cars = pd.read_csv("data/cars.csv", sep=";", skiprows=[1])
df_cars
Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
1 | Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
2 | Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
3 | AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
4 | Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
401 | Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
402 | Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
403 | Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
404 | Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
405 | Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
406 rows × 9 columns
Meta Information¶
We can also look at some basic information of our DataFrame.
df_cars.shape
(406, 9)
df_cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 406 entries, 0 to 405 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Car 406 non-null object 1 MPG 406 non-null float64 2 Cylinders 406 non-null int64 3 Displacement 406 non-null float64 4 Horsepower 406 non-null float64 5 Weight 406 non-null float64 6 Acceleration 406 non-null float64 7 Model 406 non-null int64 8 Origin 406 non-null object dtypes: float64(5), int64(2), object(2) memory usage: 28.7+ KB
If the dtype is object then it is in most cases a string. We can also look at the first or last data entries
df_cars.head() # First 10
df_cars.head(5) # First 5
Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
1 | Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
2 | Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
3 | AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
4 | Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
df_cars.tail() # Last 10
df_cars.tail(5) # Last 5
Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|---|
401 | Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
402 | Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
403 | Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
404 | Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
405 | Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
Accesing Data¶
df_cars.columns
Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin'], dtype='object')
df_cars["Car"]
0 Chevrolet Chevelle Malibu 1 Buick Skylark 320 2 Plymouth Satellite 3 AMC Rebel SST 4 Ford Torino ... 401 Ford Mustang GL 402 Volkswagen Pickup 403 Dodge Rampage 404 Ford Ranger 405 Chevy S-10 Name: Car, Length: 406, dtype: object
This is also works but other is preferred as this can cause clashes with existing DataFrame attributes for example count
df_cars.Car
0 Chevrolet Chevelle Malibu 1 Buick Skylark 320 2 Plymouth Satellite 3 AMC Rebel SST 4 Ford Torino ... 401 Ford Mustang GL 402 Volkswagen Pickup 403 Dodge Rampage 404 Ford Ranger 405 Chevy S-10 Name: Car, Length: 406, dtype: object
You can also get Multiple columns.
df_cars[["Car", "MPG", "Cylinders"]]
Car | MPG | Cylinders | |
---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 |
1 | Buick Skylark 320 | 15.0 | 8 |
2 | Plymouth Satellite | 18.0 | 8 |
3 | AMC Rebel SST | 16.0 | 8 |
4 | Ford Torino | 17.0 | 8 |
... | ... | ... | ... |
401 | Ford Mustang GL | 27.0 | 4 |
402 | Volkswagen Pickup | 44.0 | 4 |
403 | Dodge Rampage | 32.0 | 4 |
404 | Ford Ranger | 28.0 | 4 |
405 | Chevy S-10 | 31.0 | 4 |
406 rows × 3 columns
The iloc function stands for integer location and allows us to easly access data by using there indexes.
df_cars.iloc[0] # row with index 0
Car Chevrolet Chevelle Malibu MPG 18.0 Cylinders 8 Displacement 307.0 Horsepower 130.0 Weight 3504.0 Acceleration 12.0 Model 70 Origin US Name: 0, dtype: object
We can also multiple rows and specify certain columns but as we are using iloc it has to be the indexes of the columns.
df_cars.iloc[[0, 1], [0, 2, 3]] # row 0 and 1 with 3 columns
Car | Cylinders | Displacement | |
---|---|---|---|
0 | Chevrolet Chevelle Malibu | 8 | 307.0 |
1 | Buick Skylark 320 | 8 | 350.0 |
We can also us slicing just like with python lists.
df_cars.iloc[0:5, 0:3]
Car | MPG | Cylinders | |
---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 |
1 | Buick Skylark 320 | 15.0 | 8 |
2 | Plymouth Satellite | 18.0 | 8 |
3 | AMC Rebel SST | 16.0 | 8 |
4 | Ford Torino | 17.0 | 8 |
To access data there is also the loc function which works the same way as iloc but uses the labels not the indexes.
df_cars.loc[[0, 1], ["Car", "Cylinders"]]
Car | Cylinders | |
---|---|---|
0 | Chevrolet Chevelle Malibu | 8 |
1 | Buick Skylark 320 | 8 |
df_index.loc[["First", "Second"]]
name | age | designation | |
---|---|---|---|
First | a | 20 | VP |
Second | b | 27 | CEO |
You can also use slices with loc important here is however that the end values is inclusive!
df_cars.loc[0:5, "Car":"Cylinders"]
Car | MPG | Cylinders | |
---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 |
1 | Buick Skylark 320 | 15.0 | 8 |
2 | Plymouth Satellite | 18.0 | 8 |
3 | AMC Rebel SST | 16.0 | 8 |
4 | Ford Torino | 17.0 | 8 |
5 | Ford Galaxie 500 | 15.0 | 8 |
df_cars["Cylinders"].value_counts()
4 207 8 108 6 84 3 4 5 3 Name: Cylinders, dtype: int64
Indexes¶
As we have seen so far indexes are identifiers for rows. By default an index for a DataFrame or Series is just an integer sequence but as we have seen it can also be something different for example we can set a certain column as the index.
df_cars.set_index("Car")
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|
Car | ||||||||
Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
406 rows × 8 columns
We can however see that this change is not permenant. This is often the case in pandas when operating on Series or DataFrames. If we wanted these changes to be permenant we could call the method with the inplace parameter as True.
df_cars
Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
1 | Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
2 | Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
3 | AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
4 | Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
401 | Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
402 | Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
403 | Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
404 | Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
405 | Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
406 rows × 9 columns
df_cars.set_index("Car", inplace=True)
df_cars
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|
Car | ||||||||
Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
406 rows × 8 columns
You can also sort the rows by using the index
df_cars.sort_index()
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|
Car | ||||||||
AMC Ambassador Brougham | 13.0 | 8 | 360.0 | 175.0 | 3821.0 | 11.0 | 73 | US |
AMC Ambassador DPL | 15.0 | 8 | 390.0 | 190.0 | 3850.0 | 8.5 | 70 | US |
AMC Ambassador SST | 17.0 | 8 | 304.0 | 150.0 | 3672.0 | 11.5 | 72 | US |
AMC Concord | 19.4 | 6 | 232.0 | 90.0 | 3210.0 | 17.2 | 78 | US |
AMC Concord | 24.3 | 4 | 151.0 | 90.0 | 3003.0 | 20.1 | 80 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Volvo 145e (sw) | 18.0 | 4 | 121.0 | 112.0 | 2933.0 | 14.5 | 72 | Europe |
Volvo 244DL | 22.0 | 4 | 121.0 | 98.0 | 2945.0 | 14.5 | 75 | Europe |
Volvo 245 | 20.0 | 4 | 130.0 | 102.0 | 3150.0 | 15.7 | 76 | Europe |
Volvo 264gl | 17.0 | 6 | 163.0 | 125.0 | 3140.0 | 13.6 | 78 | Europe |
Volvo Diesel | 30.7 | 6 | 145.0 | 76.0 | 3160.0 | 19.6 | 81 | Europe |
406 rows × 8 columns
df_cars.reset_index(inplace=True)
df_cars
Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | US |
1 | Buick Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | US |
2 | Plymouth Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | US |
3 | AMC Rebel SST | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | US |
4 | Ford Torino | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | US |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
401 | Ford Mustang GL | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | US |
402 | Volkswagen Pickup | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | Europe |
403 | Dodge Rampage | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | US |
404 | Ford Ranger | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | US |
405 | Chevy S-10 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | US |
406 rows × 9 columns