Basic Data Exploration in Pandas Dataframe
In this exercise today, we will perform some simple data exploration using pandas in python. We will use a dataset that has information about various car models. The data is in a CSV file, mtcars.csv.
The notebook for this tutorial along with the dataset can be found here.
We can start by importing pandas and loading the data into the dataframe.
import pandas as pd data = pd.read_csv('mtcars.csv')
Now that we have our data in a dataframe, we can take a peak into the data.
We can also quickly get some statistics on the data by using the describe function.
We can also get information about the columns and datatypes of each column and the count of non-null values.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32 entries, 0 to 31 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 32 non-null object 1 mpg 32 non-null float64 2 cyl 32 non-null int64 3 disp 32 non-null float64 4 hp 32 non-null int64 5 drat 32 non-null float64 6 wt 32 non-null float64 7 qsec 32 non-null float64 8 vs 32 non-null int64 9 am 32 non-null int64 10 gear 32 non-null int64 11 carb 32 non-null int64 dtypes: float64(5), int64(6), object(1) memory usage: 3.1+ KB
We can also look for null values in the dataframe.
model 0 mpg 0 cyl 0 disp 0 hp 0 drat 0 wt 0 qsec 0 vs 0 am 0 gear 0 carb 0 dtype: int64
Now lets say we want to see which model has the maximum MPG. We can do that by finding the row in which the mpg column has the highest values.
model Toyota Corolla mpg 33.9 cyl 4 disp 71.1 hp 65 drat 4.22 wt 1.835 qsec 19.9 vs 1 am 1 gear 4 carb 1 Name: 19, dtype: object
As we can see, Toyota Corolla has the highest MPG in our dataset. If we are only interested in the name of the model, we can modify the code above by adding the name of the desired column.
Similarly, if we want value from more than one columns to be displayed, we can do that by passing the names of the columns as a list.
model Toyota Corolla wt 1.835 qsec 19.9 Name: 19, dtype: object
The opposite of idxmin, so to get a minimum value in a columns, you can use the above code but replace idxmax with idxmin.
Another step in data exploration is correlation between variables. Pandas makes that very easy. We can have it draw a correlation matrix to give us a broad sense of correlation between different variables.
If you are interested in correlation between only two variables, for example, between mpg and wt, we can calculate that as following.
Now what if we want to look at correlations of only one variable with all of the other variables. This is simple. First we draw the correlation matrix as mentioned above, cast it on variable to store it, and then can retrieve correlation of any of the columns with others.
matrix = data.corr()
mpg 1.000000 cyl -0.852162 disp -0.847551 hp -0.776168 drat 0.681172 wt -0.867659 qsec 0.418684 vs 0.664039 am 0.599832 gear 0.480285 carb -0.550925 Name: mpg, dtype: float64
We can also sort these correlation values so that we can see which variables have the most effect on mpg in descending order.
matrix['mpg'].sort_values(ascending = False)
mpg 1.000000 drat 0.681172 vs 0.664039 am 0.599832 gear 0.480285 qsec 0.418684 carb -0.550925 hp -0.776168 disp -0.847551 cyl -0.852162 wt -0.867659 Name: mpg, dtype: float64
By default, the sort_values function with display values in an ascending order therefore we set it to False to get values in a descending order.
It would not be fair to talk about correlation without talking about covariance.
Covariance determines how much a variable changes with a change in the other variable. It could be positive of negative. Positive covariance means that the variables will change in the same direction. If it is negative, the variables move in opposite directions.
The syntax for covariance is similar to correlation, but replace corr with cov. So to get a covariance matrix, you can simply use:
All the other tasks that we did with correlation can be done for covariance as well.