DataFrame in Pandas

pandas
Published

February 22, 2023

Creating DataFrames

There are several ways to create a DataFrame. The most common are from dictionaries and lists.

From a Dictionary:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)
print(df)

This code snippet creates a DataFrame from a dictionary where keys become column names and values become column data.

From a List of Lists:

data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'London'],
        ['Charlie', 28, 'Paris']]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

Here, a list of lists is used, requiring explicit column name specification.

Accessing Data

Retrieving data from a DataFrame is straightforward. You can access columns by name:

print(df['Name'])  # Accesses the 'Name' column
print(df[['Name', 'Age']]) # Accesses multiple columns

Individual rows can be accessed using .loc (label-based indexing) or .iloc (integer-based indexing):

print(df.loc[0])  # Accesses the first row by label (index 0)
print(df.iloc[1]) # Accesses the second row by integer location

Data Manipulation

Pandas DataFrames offer a rich set of functionalities for data manipulation. Here are a few examples:

Adding a New Column:

df['Country'] = ['USA', 'UK', 'France']
print(df)

Filtering Data:

filtered_df = df[df['Age'] > 28]
print(filtered_df)

This filters the DataFrame to include only rows where the ‘Age’ is greater than 28.

Sorting Data:

sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

This sorts the DataFrame by the ‘Age’ column in descending order.

Handling Missing Data

Missing data is a common problem. Pandas handles this gracefully using NaN (Not a Number) values.

df['Salary'] = [50000, 60000, float('NaN')]
print(df)
print(df.dropna()) # Removes rows with missing values

dropna() removes rows with missing values. Other methods like fillna() allow you to replace missing values with a specific value or calculated statistic.

Working with CSV Files

DataFrames excel at importing and exporting data from various sources. CSV (Comma Separated Values) files are particularly common:

Reading from a CSV:

df_csv = pd.read_csv("data.csv") # Assumes 'data.csv' is in your working directory.
print(df_csv)

Writing to a CSV:

df.to_csv("output.csv", index=False) # index=False prevents writing the index to the file.