Selecting Last Observations of Each Item and Count the Results in R

In this article, we will explore how to select the last observation for each company with a specific value in the costat variable and count the number of times each value in the dlrsn column appears per year. We will use the dplyr package for data manipulation.

Introduction

The provided data consists of companies with information about each observation for one year. The company identifier is given by the gvkey variable, and the variable costat can take two forms: “A” or “I”. We need to select the last observation for each company where costat equals “I”, count the number of times each value in the dlrsn column appears per year, and output the results.

Data Preparation

First, let’s load the required libraries and prepare the data.

library(dplyr)
library(stringr)
library(tidyr)

# Load the example dataset
example <- read.table("example", header = TRUE)

The read.table() function loads the dataset from a file named “example” with the specified header.

Filtering Rows

We will filter the rows where costat equals “I” to select only the observations with this value.

# Filter rows where costat equals "I"
example %>% 
  filter(costat == "I")

This code creates a new data frame containing only the rows where costat is equal to “I”.

Grouping by Company

Next, we will group the filtered data by the gvkey variable to select one observation per company.

# Group by gvkey and slice the last row (n = 1)
example %>% 
  filter(costat == "I") %>% 
  group_by(gvkey) %>>%
  slice_tail(n = 1)

This code groups the data by gvkey and selects the last row for each company.

Converting DLRSN to Integer

To count the number of times each value in the dlrsn column appears per year, we need to convert this column to an integer index. We can do this by matching the values with unique values or using the factor() function with specified levels and converting to an integer.

# Convert dlrsn to integer index (matching unique values)
example %>% 
  filter(costat == "I") %>% 
  group_by(gvkey) %>>%
  slice_tail(n = 1) %>>%
  ungroup() %>>%
  mutate(dlrsn = as.integer(factor(dlrsn, levels = unique(dlrsn))))

Alternatively, we can use the factor() function with specified levels and convert to an integer.

# Convert dlrsn to integer index (using factor() with levels)
example %>% 
  filter(costat == "I") %>% 
  group_by(gvkey) %>>%
  slice_tail(n = 1) %>>%
  ungroup() %>>%
  mutate(dlrsn = as.integer(factor(dlrsn, levels = c(9, 7))))

Reshaping the Data

Finally, we will reshape the data from long format to wide format using the pivot_wider() function.

# Pivot the data to return a wide format (dlrsn column)
example %>% 
  filter(costat == "I") %>% 
  group_by(gvkey) %>>%
  slice_tail(n = 1) %>>%
  ungroup() %>>%
  mutate(dlrsn = as.integer(factor(dlrsn, levels = unique(dlrsn)))) %>>%
  pivot_wider(id_cols = c(fyear), names_from = dlrsn, 
             values_from = dlrsn, values_fn = length, values_fill = 0, 
             names_prefix = "dlrsn")

This code creates a new data frame containing the desired output.

Output

The final output will be a tibble with two rows and three columns: fyear, dlrsn1, and dlrsn2.

# Final output
example %>% 
  filter(costat == "I") %>% 
  group_by(gvkey) %>>%
  slice_tail(n = 1) %>>%
  ungroup() %>>%
  mutate(dlrsn = as.integer(factor(dlrsn, levels = unique(dlrsn)))) %>>%
  pivot_wider(id_cols = c(fyear), names_from = dlrsn, 
             values_from = dlrsn, values_fn = length, values_fill = 0, 
             names_prefix = "dlrsn")

Output:

# A tibble: 2 × 3
  fyear dlrsn1 dlrsn2
  <int>  <int>  <int>
1  1977      1      0
2  1989      0      1

Last modified on 2024-06-09