Sorting Data in DataFrame

pandas
Published

November 26, 2023

Understanding DataFrame Sorting

Before diving into the code, let’s establish the fundamentals. Pandas DataFrames allow sorting by one or more columns, in ascending or descending order. The sort_values() method is your primary tool for this task.

Sorting by a Single Column

Let’s start with the simplest scenario: sorting a DataFrame by a single column. We’ll use a sample DataFrame for demonstration:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

sorted_df_age_asc = df.sort_values('Age')
print("\nSorted by Age (ascending):\n", sorted_df_age_asc)

sorted_df_age_desc = df.sort_values('Age', ascending=False)
print("\nSorted by Age (descending):\n", sorted_df_age_desc)

This code snippet first creates a sample DataFrame. Then, it demonstrates sorting by the ‘Age’ column, first in ascending order (the default) and then in descending order using the ascending parameter.

Sorting by Multiple Columns

Sorting by multiple columns involves specifying the columns in a list and optionally setting the ascending parameter for each column individually.

sorted_df_city_age = df.sort_values(['City', 'Age'])
print("\nSorted by City then Age:\n", sorted_df_city_age)

sorted_df_city_age_mixed = df.sort_values(['City', 'Age'], ascending=[True, False])
print("\nSorted by City (asc) then Age (desc):\n", sorted_df_city_age_mixed)

Here, we sort first by ‘City’ and then by ‘Age’ within each city. The second example shows how to specify different sorting orders for each column.

In-place Sorting

To modify the DataFrame directly without creating a new one, use the inplace parameter:

df.sort_values('Age', inplace=True)
print("\nDataFrame sorted in-place:\n", df)

The inplace=True argument modifies the original DataFrame instead of returning a sorted copy. Use this with caution, as it alters the original data.

Sorting with NaNs

Handling missing values (NaNs) during sorting requires careful consideration. By default, NaNs are placed at the end. You can control this behavior using the na_position parameter:

df_with_nan = pd.DataFrame({'A': [1, 2, None, 4]})

sorted_df_nan_end = df_with_nan.sort_values('A')
print("\nNaNs at the end:\n", sorted_df_nan_end)

sorted_df_nan_begin = df_with_nan.sort_values('A', na_position='first')
print("\nNaNs at the beginning:\n", sorted_df_nan_begin)

This shows how to position NaNs either at the beginning or end of the sorted DataFrame.

Leveraging sort_index()

For sorting by the DataFrame’s index, use the sort_index() method:

#Sort by Index
df.sort_index(inplace=True)
print("\nDataFrame sorted by index:\n", df)

This provides another way to organize your data based on the index values rather than column values.