Optimizing R Script for Processing Raw Transaction Data

The code provided is a R script for processing and aggregating data from raw transaction files. The main goal is to filter the data by date range, aggregate the sales by customer ID, quarter, and year, and save the final table to an output file.

Here are some key points about the code:

Filtering of Data: The script first filters the filenames based on the specified date range. It then reads only those files into a data frame (temptable), filters out rows outside the specified date range, and aggregates the sales.
Data Aggregation: The data is aggregated by both CUST_ID and QTR. This means that for each customer ID, the total sales are calculated for each quarter within the specified date range.
Output: The final aggregated table (mastertable) is saved to an output file named “OUTPUT.csv” located in the directory “YTD Master”.
Timing: The script prints the execution time by subtracting the start time from the end time.

Improvements and Suggestions

Date Range Handling: Instead of using INVOICE_DT %between% date_range, consider using INVOICE_DT >= date_range[1] & INVOICE_DT <= date_range[2]. This approach is more efficient and accurate.
Error Handling: Always check if the data frame (temptable) has any missing values or errors before proceeding with further calculations. You can use the is.na() function to detect missing values.

Here’s a slightly refactored version of your code that incorporates these suggestions:

library(data.table)
library(magrittr)

# Define parameters
in_directory <- "Raw Data"
out_directory <- "YTD Master"
filename_pattern <- "*.txt"

# Define date range
start_date <- as.Date("2017-01-01")
end_date   <- as.Date("2017-02-14")

# Filter filenames based on the specified date range
selected_filenames <- file.path(in_directory, filename_pattern) %>%
                      filter(file_name %in% seq(start_date, end_date, by = "1 month") %>% 
                              format("%Y-%m")) %>%
                      unlist() %>%
                      strsplit("/")[[1]]

# Read only the filtered files into a data frame (temptable)
temptable <- do.call(rbind, lapply(selected_filenames, read.table)) %>%
              # Convert column names to character
             colnames(.)

# Check for missing values before proceeding with further calculations
if (any(is.na(temptable))) {
  stop("Missing values found in the data.")
}

# Filter out rows outside the specified date range and aggregate sales
mastertable <- temptable %>%
               # Select only INVOICE_DT within the specified date range
              filter(INVOICE_DT >= start_date & INVOICE_DT <= end_date) %>%
              # Aggregate by customer ID, quarter, and year
             group_by(CUST_ID, QTR = quarter(INVOICE_DT)) %>%
              summarise(Ext_Sale = sum(Ext_Sale))

# Save the final table to an output file
fwrite(mastertable, paste0(out_directory, "OUTPUT.csv"))

# Print execution time
print(Sys.time() - Sys.begin())

This refactored version incorporates better date range handling and checks for missing values in the data.

Last modified on 2024-04-24