Combining CTEs with UNION ALL to Handle Data Type Mismatches and Null Values in BigQuery

Unnest Array and Union All to Other Tables with String Value

Introduction

As a data engineer or analyst, working with complex data structures is an essential part of our job. In this blog post, we’ll explore how to handle mismatches between data types when combining arrays using UNNEST and UNION ALL. We’ll also discuss ways to cast null values to overcome these issues.

Understanding CTEs

CTEs (Common Table Expressions) are temporary views that allow us to simplify complex SQL queries. In this example, we have two CTEs: unsubscribe_logs and unsubscribe_logs_part_two.

unsubscribe_logs

This CTE is used to retrieve regular records from the dl_salesforce_marketingcloud_uk.behaviour_log_salesforce_daily_unsubscribe_v1 table.

WITH unsubscribe_logs AS (
    SELECT Subscriber_Key AS cust_sf_id,
           'Email Unsubscribe' AS engagement_type,
           Email_Att1 AS campaign_code,
           PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', Insert_Local_DT) AS engagement_datetime,
           Email_Name AS details,
           CAST(NULL as STRING) AS url,
           CAST(NULL as STRING) AS link_name,
           CAST(NULL as STRING) AS domain,
           unsubscribe_type AS sub_engagement_type,
           JobID AS job_id,
           CAST(NULL as INT64) AS list_id,
           CAST(NULL as INT64) AS batch_id,
           source_timestamp AS source_timestamp,
           ingestion_timestamp AS ingestion_timestamp
    FROM `dl_salesforce_marketingcloud_uk.behaviour_log_salesforce_daily_unsubscribe_v1`
    WHERE 1=1
        AND CAST(source_timestamp AS date) >= IFNULL(filter_date,"1990-01-01")
)

unsubscribe_logs_part_two

This CTE is used to retrieve records with an array of values from the pah_andrew_test.unsubscribe_testing table.

WITH unsubscribe_logs_part_two AS (
    SELECT Subscriber_Key AS cust_sf_id,
           'Email Unsubscribe' AS engagement_type,
           Email_Att1 AS campaign_code,
           PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', Insert_Local_DT) AS engagement_datetime,
           Email_Name AS details,
           CAST(NULL as STRING) AS url,
           CAST(NULL as STRING) AS link_name,
           CAST(NULL as STRING) AS domain,
           JobID AS job_id,
           CAST(NULL as INT64) AS list_id,
           CAST(NULL as INT64) AS batch_id,
           source_timestamp AS source_timestamp,
           ingestion_timestamp AS ingestion_timestamp,
           split(unsubscribe_type, " and ") AS unsub_type_parts
    FROM `pah_andrew_test.unsubscribe_testing`
    WHERE 1=1
        AND unsubscribe_type != 'Marketing' AND unsubscribe_type != 'Reminder' AND unsubscribe_type != ''
        AND CAST(source_timestamp AS date) >= IFNULL(filter_date,"1990-01-01")
)

Combining CTEs using UNION ALL

To combine the two CTEs, we use the UNION ALL operator.

all_salesforce_engagement as (
    SELECT NULLIF(cust_sf_id, '') AS cust_sf_id,
           engagement_type,
           NULLIF(campaign_code, '') AS campaign_code,
           engagement_datetime,
           NULLIF(details, '') AS details,
           NULLIF(url, '') AS url,
           NULLIF(link_name, '') AS link_name,
           NULLIF(domain, '') AS domain,
           NULLIF(TRIM(sub_engagement_type), '') AS sub_engagement_type,
           CAST(job_id AS STRING) AS job_id,
           CAST(list_id AS STRING) AS list_id,
           CAST(batch_id AS STRING) AS batch_id,
           source_timestamp AS source_timestamp,
           ingestion_timestamp AS ingestion_timestamp

    FROM (
        SELECT * FROM unsubscribe_logs
        UNION ALL
        SELECT * EXCEPT (unsub_type_parts) FROM unsubscribe_logs_part_two LEFT JOIN unnest(unsub_type_parts) AS sub_engagement_type)
)

Mismatches and Null Values

When combining the two CTEs, we encounter a mismatch between data types. The unsubscribe_logs CTE has null values in the unsub_type_parts column, while the unsubscribe_logs_part_two CTE returns an array of strings.

To cast the null values to a string type, we can use the COALESCE function or the NULLIF function.

all_salesforce_engagement as (
    SELECT NULLIF(cust_sf_id, '') AS cust_sf_id,
           engagement_type,
           NULLIF(campaign_code, '') AS campaign_code,
           engagement_datetime,
           COALESCE(details, '') AS details,
           COALESCE(url, '') AS url,
           COALESCE(link_name, '') AS link_name,
           COALESCE(domain, '') AS domain,
           NULLIF(TRIM(sub_engagement_type), '') AS sub_engagement_type,
           CAST(job_id AS STRING) AS job_id,
           CAST(list_id AS STRING) AS list_id,
           CAST(batch_id AS STRING) AS batch_id,
           source_timestamp AS source_timestamp,
           ingestion_timestamp AS ingestion_timestamp

    FROM (
        SELECT * FROM unsubscribe_logs
        UNION ALL
        SELECT * EXCEPT (unsub_type_parts) FROM unsubscribe_logs_part_two LEFT JOIN unnest(unsub_type_parts) AS sub_engagement_type)
)

In the above code, we use COALESCE to return an empty string instead of null values.

Another approach is to use the NULLIF function to cast null values to a string type. However, this might lead to unexpected results if you’re not careful.

all_salesforce_engagement as (
    SELECT NULLIF(cust_sf_id, '') AS cust_sf_id,
           engagement_type,
           NULLIF(campaign_code, '') AS campaign_code,
           engagement_datetime,
           NULLIF(details, '') AS details,
           NULLIF(url, '') AS url,
           NULLIF(link_name, '') AS link_name,
           NULLIF(domain, '') AS domain,
           NULLIF(TRIM(sub_engagement_type), '') AS sub_engagement_type,
           CAST(job_id AS STRING) AS job_id,
           CAST(list_id AS STRING) AS list_id,
           CAST(batch_id AS STRING) AS batch_id,
           source_timestamp AS source_timestamp,
           ingestion_timestamp AS ingestion_timestamp

    FROM (
        SELECT * FROM unsubscribe_logs
        UNION ALL
        SELECT * EXCEPT (unsub_type_parts) FROM unsubscribe_logs_part_two LEFT JOIN unnest(unsub_type_parts) AS sub_engagement_type)
)

In this example, we use NULLIF to cast null values to an empty string. However, if the original value is not a number, using NULLIF might lead to unexpected results.

Conclusion

Combining CTEs using UNNEST and UNION ALL can be challenging when dealing with data type mismatches. By using COALESCE or NULLIF, we can cast null values to a string type and overcome these issues. However, it’s essential to carefully consider the implications of using these functions in your specific use case.

In addition to casting null values, you can also modify your query to avoid the mismatch altogether. For example, you could join the two tables instead of using UNION ALL, or you could transform the data before combining it.

By understanding how to handle mismatches between data types, you can write more robust and efficient SQL queries that accurately combine complex data structures.


Last modified on 2025-03-03