Pandas Data Frame 101: Filtering Data, loc & iloc
In today’s tutorial, we will learn how to select data from a data frame. Let’s download stocks data from Yahoo Finance by clicking the download link on the page, which lets you download the data as a CSV file. This is historic data of S&P 500 stocks. I renamed the file to stocks.csv. Let’s start by importing pandas and load data from the CSV file. You can download the CVS file and the notebook containing this code here.
import pandas as pddata=pd.read_csv('stocks.csv')
Let’s look at the columns we have in our DataFrame.
We get the following output:
Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Using the ‘loc’ property:
loc can be used to retrieve data based on labels. By default, the index column of the data frame is considered a label. So in our data, if we want to retrieve data in all columns in row 3, we will use:
Date 2020-01-15 Open 3282.27 High 3298.66 Low 3280.69 Close 3289.29 Adj Close 3289.29 Volume 3716840000 Name: 3, dtype: object
However, for the ease of readability, it might be best to assign more reasonable labels. So in our data, since it can be considered time series data, we can assign the Date columns as the index.
Now that we are using dates as labels, it will be very easy to retrieve data. Say we want to look at data from the 4th of January.
Date 2021-01-04 Open 3764.61 High 3769.99 Low 3662.71 Close 3700.65 Adj Close 3700.65 Volume 5006680000 Name: 2021-01-04, dtype: object
Now let’s say we are interested in a specific column, say the closing price. We can do that by passing the column name.
To retrieve data from more than one column, you pass on the names of the columns as a list. Similarly, if you need data from more than one rows, you can pass the labels as a list too.
print(data.loc['2021-01-04', ['Open', 'Close']])# Open and close for Janurary 4 and 5.print(data.loc[['2021-01-04','2021-01-05'], ['Open','Close']])Open 3764.61 Close 3700.65 Name: 2021-01-04, dtype: object Open Close Date 2021-01-04 3764.610107 3700.649902 2021-01-05 3698.020020 3726.860107
Now if you want to view certain columns for the whole DataFrame, you can do it as follow:
print(data.loc[:, ['Open','Close']])Open Close 0 3281.810059 3265.350098 1 3271.129883 3288.129883 2 3285.350098 3283.149902 3 3282.270020 3289.290039 4 3302.969971 3316.810059 .. ... ... 247 3764.610107 3700.649902 248 3698.020020 3726.860107 249 3712.199951 3748.139893 250 3764.709961 3803.790039 251 3815.050049 3824.679932[252 rows x 2 columns]
iloc enables you to retrieve data based on imaginary numeric index. Notice that the numeric index is imaginary, means that it is the actual index column. For example, in our data, the Date column was assigned as an index. But using iloc, you can retrieve data based on index numbers starting from 0. So for example, if you want to retrieve the row at index 0, you can do it as follow:
Date 2020-01-10 Open 3281.81 High 3282.99 Low 3260.86 Close 3265.35 Adj Close 3265.35 Volume 3212970000 Name: 0, dtype: object
In a similar fashion, you can use iloc to retrieve data from a specific row and column by providing indices for both. Say we want to retrieve data from the 2nd row and 3rd column.
One may wonder what could be the use of this. In our data, imaging you want to retrieve the last 3 rows. You can do that using iloc.
print(data.iloc[-1])Date 2021-01-08 Open 3815.05 High 3826.69 Low 3783.6 Close 3824.68 Adj Close 3824.68 Volume 4764180000 Name: 251, dtype: object
So just like a list, you can use indices to retrieve date or range of data from the beginning or end, as well as use steps. For example, you want to pick every 3rd row, you can do so.
Using iloc, we can actually set the value of cells. For example, in our data, let’s say we want to set the Volume to 0 for every other row.
Selecting data conditionally:
Pandas also provide us to pick and drop rows conditionally. For example, we want to display data only for dates on which the closing price was above the mean closing price. To do so, first we define the condition, assign it to an object and then pass that object to the loc property.
Another example would be to pick a date with a specific price. So we want to know data from the day where the closing price was 3700.649902.
condition = data.Close == 3700.649902data.loc[condition]
That’s it for now guys. Pandas truly have some superpowers and efficient data filtering is definitely one of them.