Filtering and Joining Data in Spark SQL to Display Full Names

When working with data in Spark SQL, it’s not uncommon to encounter missing or null values. In this article, we’ll explore a common challenge: how to display full names for individuals who have logged in and those who haven’t. We’ll delve into filtering, joining, and selecting data to achieve this goal.

Problem Description

The problem at hand involves a table with an ID column, which uniquely identifies each person. However, when the same person appears multiple times in different categories, their name may be null if they haven’t logged in. The task is to show all values with a certain ID along with their corresponding names.

Let’s consider the example provided:

ID	First Name	Last Name	Login	Date
1245	Matt	Carter	Yes	12-03-2022
2344	Emily	Seuss	Yes	12-01-2022
1245	NULL	NULL	No	11-04-2022
4266	Drew	Bob	Yes	10-03-2022

We want to display the full names for each person, including those who haven’t logged in.

Approach Overview

To solve this problem, we’ll use a combination of filtering, joining, and selecting data. Here’s an overview of our approach:

Filter out rows with missing or null values in the FirstName or LastName columns.
Join the filtered dataframe with the original table on the ID column to include the full names.
Select only the desired columns from the joined dataframe.

Step 1: Filtering Out Missing Values

We’ll start by filtering out rows with missing or null values in the FirstName or LastName columns. This will help us focus on individuals who have logged in and provide their full names.

// Create a new dataframe with only non-null values
df = spark.createDataFrame(data, ["ID", "FirstName", "LastName", "Login", "Date"])
.filter("FirstName is not null or LastName is not null")

Step 2: Joining the Filtered Dataframe

Next, we’ll join the filtered dataframe with the original table on the ID column. This will allow us to include the full names for each person.

// Create a new dataframe with the full names
nameDf = df.selectExpr("id as nameDf_ID", "FirstName as nameDf_FirstName", "LastName as nameDf_LastName")
.distinct()

// Join the filtered dataframe with the original table on the ID column
df_joined = df.join(nameDf, df.ID == nameDf.nameDf_ID)

Step 3: Selecting Desired Columns

Finally, we’ll select only the desired columns from the joined dataframe. We want to display the full names along with the ID, Login, and Date columns.

// Select only the desired columns from the joined dataframe
df_final = df_joined.selectExpr("ID", "nameDf_FirstName as FirstName", "nameDf_LastName as LastName", "Login", "Date")

Example Use Case

Let’s go back to our example data:

ID	First Name	Last Name	Login	Date
1245	Matt	Carter	Yes	12-03-2022
2344	Emily	Seuss	Yes	12-01-2022
1245	NULL	NULL	No	11-04-2022
4266	Drew	Bob	Yes	10-03-2022

By following the steps outlined above, we can display the full names for each person, including those who haven’t logged in:

+—-+———+——–+—–+———-+ | ID|FirstName|LastName|Login| Date| +—-+———+——–+—–+———-+ |1245| Matt| Carter| Yes|12-03-2022| |1245| Matt| Carter| No|11-04-2022| |2344| Emily| Seuss| Yes|12-01-2022| |4266| Drew| Bob| Yes|10-03-2022| +—-+———+——–+—–+———-+

Conclusion

In this article, we explored a common challenge in Spark SQL: how to display full names for individuals who have logged in and those who haven’t. By using a combination of filtering, joining, and selecting data, we were able to achieve our goal.

We hope that this tutorial has provided you with a better understanding of Spark SQL’s capabilities and how to tackle complex data manipulation tasks. If you have any questions or need further clarification on any of the steps outlined in this article, please don’t hesitate to reach out.

Last modified on 2023-05-09