top of page
  • doctorsmonsters

Pandas Data Frame 101: Creating Data Frame & Loading Data



What is pandas Data Frame?

The simplest way of describing Data Frame is to think of it as a spreadsheet or table, with rows and columns. Columns are labeled by column names and row have an index column. However, Data Frame is very efficient for large amount of data. It makes applying functions to the data very efficient, both in terms of coding as well as runtime. It is probably the most commonly used pandas data object.

First of all we will import pandas as pd:


import pandas as pd

Creating a Data Frame object:

There are a few ways of creating Data Frame. Let’s start by using dictionaries. We can define our data in the form of a dictionary where each column is represented by an item in the dictionary. The key is the column name and the value contains a list of the data items. Let’s look at the following example. We have a dictionary containing our data.


data = {‘Name’:[‘John’, ‘Mike’, ‘Samantha’], ‘Score’:[6,3,9], ‘Outcome’:[‘Pass’, ’Fail’, ’Pass’]}

Now we can convert it to a pandas Data Frame with one line:


scores = pd.DataFrame(data)

We can look into the contents of our Data Frame by printing it out.


print(scores)

The output will look like bellow:


Name  Score Outcome
0      John      6    Pass
1      Mike      3    Fail
2  Samantha      9    Pass

In the previous example, we used a dictionary to make a Data Frame. Now we will see how we can make a Data Frame from a list of dictionaries. For such a list, think of each item, which is a dictionary, representing each row, with the key holding the name of the column and the value holding the value for that row.


# Make list of dictionaries
data = [{‘Name’: ‘John’, ‘Score’: 6, ‘Outcome’:’Pass’ },
 {‘Name’: ‘Mike’, ‘Score’: 3, ‘Outcome’:’Fail’ },
 {‘Name’: ‘Samantha’, ‘Score’: 9, ‘Outcome’:’Pass’ },
 ]# Create DataFrame from the list
scores = pd.DataFrame(data=data)
print(scores)

The final method of creating Data Frame from provided data that we will discuss is from a list of lists. Here, you have a list containing column names, and another list of lists with each item list containing data for each row in the Data Frame. It is important that the items remain in the same order as the list containing column names.


# Create a list of lists
data = [[‘John’, 6, ‘Pass’],
 [‘Mike’, 3, ‘Fail’],
 [‘Samantha’,9, ‘Pass’]]# Define the column names
columns = [‘Name’, ‘Score’, ‘Outcome’]# Create a DataFrame with the data and column names
scores = pd.DataFrame(data=data, columns=columns)
print(scores)

Loading data from files:

One of the best things about pandas is its ease of implementation. In case of creating a Data Frame from data in a file, pandas does all the heavy lifting for you. Imagine you have all the scores of your students in a csv file named scores.csv, here’s how you will be loading it into a Data Frame:


# Read the data
scores = pd.read_csv(‘scores.csv’)

Another common file format that you have is Excel spreadsheets. As you may know, one Excel file may contain many spreadsheets, we can specify which spreadsheet the data should be loaded from. For example, say we have a file, scores.xlsx, with a spreadsheet representing a term exam and we are specifically interested in the sheet for midterm.


scores = pd.read_excel(‘scores.xlsx’, sheet_name= ‘midterm’, na_values = ‘None’)

The na_values tells the code to replace any empty field with “None”. You can run the code without it but applying it tells the code how to handle an empty field and reduces the chances of error.

Now what if you had to load data from all the sheets into the Data Frame. To do so, we create and Excel file object, a list of all the sheets in the file, load data from each sheet in a list that is appended to a list of lists with each item/list representing each sheet. All the lists are then concatenated into a Data Frame. I am aware I did not do a good job of describing it but it will make sense when we do this step by step.


# Create pd.ExcelFile() objectxls = pd.ExcelFile(‘scores.xlsx’)# Extract sheet names and store in exchangesexams = xls.sheet_names# Create an empty list: exam_sheets
exam_sheets=[]# Now we iterate over the list containing names of the sheets (exams), loading data into a Data Frame and then appending that to exam_sheetsfor exam in exams:
    #load data into a Data Frame
    exam_sheet = pd.read_excel(xls, sheetname=exam, na_values=’n/a’)    #Add a column containing name of the sheet
    exam_sheet[‘Exam’] = exam    #append the Data Frame to the list
    exam_sheets.append(exam_sheet)# Concatenate the listings: listing_data
scores = pd.concat(exam_sheets)

There you go! This was a basic tutorial to show how to create Data Frame and fill it with data.

16 views0 comments

Comments


Post: Blog2_Post
bottom of page