Basic Data Exploration in Pandas Dataframe

In this exercise today, we will perform some simple data exploration using pandas in python. We will use a dataset that has information about various car models. The data is in a CSV file, mtcars.csv.

The notebook for this tutorial along with the dataset can be found here.

We can start by importing pandas and loading the data into the dataframe.

import pandas as pd
data = pd.read_csv('mtcars.csv')

Now that we have our data in a dataframe, we can take a peak into the data.

data.head()

We can also quickly get some statistics on the data by using the describe function.

data.describe()

We can also get information about the columns and datatypes of each column and the count of non-null values.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB

We can also look for null values in the dataframe.

data.isnull().sum()

model    0
mpg      0
cyl      0
disp     0
hp       0
drat     0
wt       0
qsec     0
vs       0
am       0
gear     0
carb     0
dtype: int64

Now lets say we want to see which model has the maximum MPG. We can do that by finding the row in which the mpg column has the highest values.

data.loc[data['mpg'].idxmax()]

model    Toyota Corolla
mpg                33.9
cyl                   4
disp               71.1
hp                   65
drat               4.22
wt                1.835
qsec               19.9
vs                    1
am                    1
gear                  4
carb                  1
Name: 19, dtype: object

As we can see, Toyota Corolla has the highest MPG in our dataset. If we are only interested in the name of the model, we can modify the code above by adding the name of the desired column.

data.loc[data['mpg'].idxmax()]['model']

'Toyota Corolla'

Similarly, if we want value from more than one columns to be displayed, we can do that by passing the names of the columns as a list.

data.loc[data['mpg'].idxmax()][['model','wt','qsec']]

model    Toyota Corolla
wt                1.835
qsec               19.9
Name: 19, dtype: object

The opposite of idxmin, so to get a minimum value in a columns, you can use the above code but replace idxmax with idxmin.

Another step in data exploration is correlation between variables. Pandas makes that very easy. We can have it draw a correlation matrix to give us a broad sense of correlation between different variables.

data.corr()

If you are interested in correlation between only two variables, for example, between mpg and wt, we can calculate that as following.

data.mpg.corr(data.wt)

-0.8676593765172279

Now what if we want to look at correlations of only one variable with all of the other variables. This is simple. First we draw the correlation matrix as mentioned above, cast it on variable to store it, and then can retrieve correlation of any of the columns with others.

matrix = data.corr()

matrix['mpg']

mpg     1.000000
cyl    -0.852162
disp   -0.847551
hp     -0.776168
drat    0.681172
wt     -0.867659
qsec    0.418684
vs      0.664039
am      0.599832
gear    0.480285
carb   -0.550925
Name: mpg, dtype: float64

We can also sort these correlation values so that we can see which variables have the most effect on mpg in descending order.

matrix['mpg'].sort_values(ascending = False)

mpg     1.000000
drat    0.681172
vs      0.664039
am      0.599832
gear    0.480285
qsec    0.418684
carb   -0.550925
hp     -0.776168
disp   -0.847551
cyl    -0.852162
wt     -0.867659
Name: mpg, dtype: float64

By default, the sort_values function with display values in an ascending order therefore we set it to False to get values in a descending order.

It would not be fair to talk about correlation without talking about covariance.

Covariance determines how much a variable changes with a change in the other variable. It could be positive of negative. Positive covariance means that the variables will change in the same direction. If it is negative, the variables move in opposite directions.

The syntax for covariance is similar to correlation, but replace corr with cov. So to get a covariance matrix, you can simply use:

data.cov()

All the other tasks that we did with correlation can be done for covariance as well.

Basic Data Exploration in Pandas Dataframe

Recent Posts

Comments

Subscribe Form