R Container for Multiple data.frame with a Brief Description of the Content of the data.frame

Question 1

For a project I have a large dataset with a multitude of variables from different questionnaires. Not all variables are required for all analyses.

So I created a preprocessing script, in which subsets of variables (with and without abbreviations) are created. However it gets confusing pretty fast. For convenience I decided to create a index_list which holds all data.frames as well as a data.frame called index_df which holds the name of the respective data.frame as well as a brief description of each subversion of the dataset.

######################## Preparation / Loading #####################
# Clean Out Global Environment
rm(list=ls())
# Detach all unnecessary pacakges
pacman::p_unload()
# Load Required Libraries
pacman::p_load(dplyr, tidyr, gridExtra, conflicted)
# Load Data
#source("00_Preprocess.R")
#create simulation data instead
sub_data <- data.frame(x=c(2,3,5,1,6),y=c(20,30,10,302,5))
uv_newScale <- data.frame(item1=c(2,3,5,1,6),item2=c(3,5,1,3,2))
# Resolving conflicted Namepsaces
conflict_prefer("filter", "dplyr")
# Creating an Index 
index_list <- list("sub_data"=sub_data,
 "uv_newScale"=uv_newScale
 )
index_df <- data.frame("Data.Frame"=c("sub_data",
 "uv_newScale"),
 "Description"=c("Contains all sumscales + sociodemographics, names abbreviated",
 "Only sum scores for the UV Scale"))

I am wondering if there is a more efficient way to do so. Like saving the data.frames together with the description in one container?

Question 2

One approach

If the description is to serve only as a metadata, R allows you to add any number of metadata to objects with attributes feature. you can add it as an attribute for the data.frame object.

attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
# Description now shows in the data frame structure 
str(sub_data)
# 'data.frame': 5 obs. of 2 variables:
# $ x: num 2 3 5 1 6
# $ y: num 20 30 10 302 5
# - attr(*, "Description")= chr "Contains all sumscales + sociodemographics, names # abbreviated"
#You can access the Description only
attributes(sub_data)$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"

However, custom attributes come with limitations.They are not persistent when you perform certain operations on objects such as subsetting. Here is an example using same object with the new Description attribute we just added. If we subset the data, the custom attribute will be lost.

sub2_data <- sub_data[,"x", drop = FALSE]
attributes(sub2_data)$Description
# NULL

Alternative approach

You can use the same idea of creating a container within container. However, instead of creating a list that contains data frames, you can create a data.frame within a data.frame. This makes it easier to access and manipulate. You can access the inner data frame by adding second $

# Assigning new column `data` to hold data frames 
index_df$data <- index_list
# We can access Description 
index_df$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"
# [2] "Only sum scores for the UV Scale" 
# Accessing Data 
index_df$data$sub_data
# x y
#1 2 20
#2 3 30
#3 5 10
#4 1 302
#5 6 5
str(index_df)
# 'data.frame': 2 obs. of 3 variables:
# $ Data.Frame : chr "sub_data" "uv_newScale"
# $ Description: chr "Contains all sumscales + sociodemographics, names abbreviated" "Only sum scores for the UV Scale"
# $ data :List of 2
# ..$ sub_data :'data.frame': 5 obs. of 2 variables:
# .. ..$ x: num 2 3 5 1 6
# .. ..$ y: num 20 30 10 302 5
# ..$ uv_newScale:'data.frame': 5 obs. of 2 variables:
# .. ..$ item1: num 2 3 5 1 6
# .. ..$ item2: num 3 5 1 3 2

Efficiency

The first approach is more efficient than the second in terms of memory footprint. Here is a comparison between objects sizes in bytes.

# Creating one data frame within one data frame for comparison
list_df <- list(sub_data=sub_data)
dfs <- data.frame("Data.Frame"="sub_data",
 "Description"="Contains all sumscales + sociodemographics, names abbreviated")
dfs$data <- list_df 
# Adding attributes
attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
 
# Memory Size in bytes
object.size(dfs)
# 2376 bytes
object.size(sub_data)
# 1224 bytes

Question 3

Perfect. Thanks this was exactly what I was looking for

Hussain Alsalman Hussain Alsalman 713 bronze badges · Accepted Answer · 2021-04-16 10:29:04Z

One approach

If the description is to serve only as a metadata, R allows you to add any number of metadata to objects with attributes feature. you can add it as an attribute for the data.frame object.

attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
# Description now shows in the data frame structure 
str(sub_data)
# 'data.frame': 5 obs. of 2 variables:
# $ x: num 2 3 5 1 6
# $ y: num 20 30 10 302 5
# - attr(*, "Description")= chr "Contains all sumscales + sociodemographics, names # abbreviated"
#You can access the Description only
attributes(sub_data)$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"

However, custom attributes come with limitations.They are not persistent when you perform certain operations on objects such as subsetting. Here is an example using same object with the new Description attribute we just added. If we subset the data, the custom attribute will be lost.

sub2_data <- sub_data[,"x", drop = FALSE]
attributes(sub2_data)$Description
# NULL

Alternative approach

You can use the same idea of creating a container within container. However, instead of creating a list that contains data frames, you can create a data.frame within a data.frame. This makes it easier to access and manipulate. You can access the inner data frame by adding second $

# Assigning new column `data` to hold data frames 
index_df$data <- index_list
# We can access Description 
index_df$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"
# [2] "Only sum scores for the UV Scale" 
# Accessing Data 
index_df$data$sub_data
# x y
#1 2 20
#2 3 30
#3 5 10
#4 1 302
#5 6 5
str(index_df)
# 'data.frame': 2 obs. of 3 variables:
# $ Data.Frame : chr "sub_data" "uv_newScale"
# $ Description: chr "Contains all sumscales + sociodemographics, names abbreviated" "Only sum scores for the UV Scale"
# $ data :List of 2
# ..$ sub_data :'data.frame': 5 obs. of 2 variables:
# .. ..$ x: num 2 3 5 1 6
# .. ..$ y: num 20 30 10 302 5
# ..$ uv_newScale:'data.frame': 5 obs. of 2 variables:
# .. ..$ item1: num 2 3 5 1 6
# .. ..$ item2: num 3 5 1 3 2

Efficiency

The first approach is more efficient than the second in terms of memory footprint. Here is a comparison between objects sizes in bytes.

# Creating one data frame within one data frame for comparison
list_df <- list(sub_data=sub_data)
dfs <- data.frame("Data.Frame"="sub_data",
 "Description"="Contains all sumscales + sociodemographics, names abbreviated")
dfs$data <- list_df 
# Adding attributes
attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
 
# Memory Size in bytes
object.size(dfs)
# 2376 bytes
object.size(sub_data)
# 1224 bytes

\$\begingroup\$ Perfect. Thanks this was exactly what I was looking for \$\endgroup\$

SysRIP
– SysRIP

2021年04月20日 07:43:54 +00:00
Commented Apr 20, 2021 at 7:43

Stack Exchange Network

R Container for Multiple data.frame with a Brief Description of the Content of the data.frame

1 Answer 1

One approach

Alternative approach

Efficiency

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

R Container for Multiple data.frame with a Brief Description of the Content of the data.frame

1 Answer 1

One approach

Alternative approach

Efficiency

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions