top of page
  • doctorsmonsters

Pandas Data Frame 101: Filtering Data, loc & iloc


In today’s tutorial, we will learn how to select data from a data frame. Let’s download stocks data from Yahoo Finance by clicking the download link on the page, which lets you download the data as a CSV file. This is historic data of S&P 500 stocks. I renamed the file to stocks.csv. Let’s start by importing pandas and load data from the CSV file. You can download the CVS file and the notebook containing this code here.

import pandas as pddata=pd.read_csv('stocks.csv')

Let’s look at the columns we have in our DataFrame.

data.columns

We get the following output:

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

Using the ‘loc’ property:

loc can be used to retrieve data based on labels. By default, the index column of the data frame is considered a label. So in our data, if we want to retrieve data in all columns in row 3, we will use:

data.loc[3]

Output:

Date         2020-01-15
Open            3282.27
High            3298.66
Low             3280.69
Close           3289.29
Adj Close       3289.29
Volume       3716840000
Name: 3, dtype: object

However, for the ease of readability, it might be best to assign more reasonable labels. So in our data, since it can be considered time series data, we can assign the Date columns as the index.

data.index=data['Date']

Now that we are using dates as labels, it will be very easy to retrieve data. Say we want to look at data from the 4th of January.

data.loc['2021-01-04']

Output:

Date         2021-01-04
Open            3764.61
High            3769.99
Low             3662.71
Close           3700.65
Adj Close       3700.65
Volume       5006680000
Name: 2021-01-04, dtype: object

Now let’s say we are interested in a specific column, say the closing price. We can do that by passing the column name.

data.loc['2021-01-04', 'Close']

Output:

3700.649902

To retrieve data from more than one column, you pass on the names of the columns as a list. Similarly, if you need data from more than one rows, you can pass the labels as a list too.

print(data.loc['2021-01-04', ['Open', 'Close']])# Open and close for Janurary 4 and 5.print(data.loc[['2021-01-04','2021-01-05'], ['Open','Close']])Open     3764.61
Close    3700.65
Name: 2021-01-04, dtype: object
                   Open        Close
Date                                
2021-01-04  3764.610107  3700.649902
2021-01-05  3698.020020  3726.860107

Now if you want to view certain columns for the whole DataFrame, you can do it as follow:

​print(data.loc[:, ['Open','Close']])Open        Close
0    3281.810059  3265.350098
1    3271.129883  3288.129883
2    3285.350098  3283.149902
3    3282.270020  3289.290039
4    3302.969971  3316.810059
..           ...          ...
247  3764.610107  3700.649902
248  3698.020020  3726.860107
249  3712.199951  3748.139893
250  3764.709961  3803.790039
251  3815.050049  3824.679932[252 rows x 2 columns]

Using “iloc”:

iloc enables you to retrieve data based on imaginary numeric index. Notice that the numeric index is imaginary, means that it is the actual index column. For example, in our data, the Date column was assigned as an index. But using iloc, you can retrieve data based on index numbers starting from 0. So for example, if you want to retrieve the row at index 0, you can do it as follow:

data.iloc[0]

Output:

Date         2020-01-10
Open            3281.81
High            3282.99
Low             3260.86
Close           3265.35
Adj Close       3265.35
Volume       3212970000
Name: 0, dtype: object

In a similar fashion, you can use iloc to retrieve data from a specific row and column by providing indices for both. Say we want to retrieve data from the 2nd row and 3rd column.

data.iloc[2,3]

Output:

3277.189941

One may wonder what could be the use of this. In our data, imaging you want to retrieve the last 3 rows. You can do that using iloc.

print(data.iloc[-1])Date         2021-01-08
Open            3815.05
High            3826.69
Low              3783.6
Close           3824.68
Adj Close       3824.68
Volume       4764180000
Name: 251, dtype: object

So just like a list, you can use indices to retrieve date or range of data from the beginning or end, as well as use steps. For example, you want to pick every 3rd row, you can do so.

data.iloc[::3]

Using iloc, we can actually set the value of cells. For example, in our data, let’s say we want to set the Volume to 0 for every other row.

In [81]:

data.iloc[::2,6]=0

Selecting data conditionally:

Pandas also provide us to pick and drop rows conditionally. For example, we want to display data only for dates on which the closing price was above the mean closing price. To do so, first we define the condition, assign it to an object and then pass that object to the loc property.

high_close=data.Close>data.Close.mean()data.loc[high_close]

Another example would be to pick a date with a specific price. So we want to know data from the day where the closing price was 3700.649902.

condition = data.Close == 3700.649902data.loc[condition]

That’s it for now guys. Pandas truly have some superpowers and efficient data filtering is definitely one of them.

Comments


Post: Blog2_Post
bottom of page