DataFrame from Excel Files – Mastering Python

Setting the Stage: Necessary Libraries

Before we begin, ensure you have the necessary libraries installed. Pandas is the core library for DataFrame manipulation, while openpyxl (for .xlsx files) or xlrd (for .xls files) are needed to read Excel files. You can install them using pip:

pip install pandas openpyxl xlrd

Reading Excel Files into Pandas DataFrames

Pandas provides the read_excel() function to effortlessly import data. Let’s explore its usage with several examples.

Example 1: Reading a Simple Excel File

Let’s assume you have an Excel file named data.xlsx with a single sheet named “Sheet1”. The following code reads this sheet into a DataFrame:

import pandas as pd

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

print(df)

This code snippet uses the default behavior of read_excel(), reading the first sheet by default if sheet_name is omitted.

Example 2: Specifying the Sheet Name

If your Excel file has multiple sheets, you must specify the sheet you want to read using the sheet_name parameter:

df_sales = pd.read_excel("data.xlsx", sheet_name="Sales Data")
print(df_sales)

#Reading multiple sheets
xls = pd.ExcelFile("data.xlsx")
df_sheet1 = pd.read_excel(xls, 'Sheet1')
df_sheet2 = pd.read_excel(xls, 'Sheet2')
print(df_sheet1)
print(df_sheet2)

Example 3: Handling Different Excel File Formats

read_excel handles both .xlsx (the newer format) and .xls (the older format). The library automatically detects the file type. However, for very large xls files, consider using other libraries like xlrd which is highly optimized for them.

#Reading xls file
df_xls = pd.read_excel("data.xls") # assuming you have a data.xls file
print(df_xls)

Example 4: Selecting Specific Columns

You can import only specific columns from the Excel file to reduce memory usage and improve performance:

df_selected = pd.read_excel("data.xlsx", usecols=['Name', 'Age'])
print(df_selected)

Example 5: Skipping Rows and Handling Headers

Sometimes, your Excel file might have header rows you want to skip, or you might need to specify a different row as the header. read_excel allows for this:

df_skipped = pd.read_excel("data.xlsx", skiprows=1)
print(df_skipped)

#Specify header row
df_header = pd.read_excel("data.xlsx", header=2) #Assuming header is in the 3rd row
print(df_header)

Example 6: Handling Different Data Types

By default pandas will try to infer data types from your excel file. However, you can specify the data type for each column explicitly using the dtype argument. This might be necessary for better performance or to avoid data type errors.

df_typed = pd.read_excel("data.xlsx", dtype={'Age': 'Int64', 'Sales': 'float64'})
print(df_typed)