For a project I have a large dataset with a multitude of variables from different questionnaires. Not all variables are required for all analyses.
So I created a preprocessing script, in which subsets of variables (with and without abbreviations) are created. However it gets confusing pretty fast.
For convenience I decided to create a index_list
which holds all data.frames as well as a data.frame called index_df
which holds the name of the respective data.frame as well as a brief description of each subversion of the dataset.
######################## Preparation / Loading #####################
# Clean Out Global Environment
rm(list=ls())
# Detach all unnecessary pacakges
pacman::p_unload()
# Load Required Libraries
pacman::p_load(dplyr, tidyr, gridExtra, conflicted)
# Load Data
#source("00_Preprocess.R")
#create simulation data instead
sub_data <- data.frame(x=c(2,3,5,1,6),y=c(20,30,10,302,5))
uv_newScale <- data.frame(item1=c(2,3,5,1,6),item2=c(3,5,1,3,2))
# Resolving conflicted Namepsaces
conflict_prefer("filter", "dplyr")
# Creating an Index
index_list <- list("sub_data"=sub_data,
"uv_newScale"=uv_newScale
)
index_df <- data.frame("Data.Frame"=c("sub_data",
"uv_newScale"),
"Description"=c("Contains all sumscales + sociodemographics, names abbreviated",
"Only sum scores for the UV Scale"))
I am wondering if there is a more efficient way to do so. Like saving the data.frames together with the description in one container?
1 Answer 1
One approach
If the description is to serve only as a metadata, R
allows you to add any number of metadata to objects with attributes feature. you can add it as an attribute for the data.frame
object.
attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
# Description now shows in the data frame structure
str(sub_data)
# 'data.frame': 5 obs. of 2 variables:
# $ x: num 2 3 5 1 6
# $ y: num 20 30 10 302 5
# - attr(*, "Description")= chr "Contains all sumscales + sociodemographics, names # abbreviated"
#You can access the Description only
attributes(sub_data)$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"
However, custom attributes come with limitations.They are not persistent when you perform certain operations on objects such as subsetting. Here is an example using same object with the new Description attribute we just added. If we subset the data, the custom attribute will be lost.
sub2_data <- sub_data[,"x", drop = FALSE]
attributes(sub2_data)$Description
# NULL
Alternative approach
You can use the same idea of creating a container within container. However, instead of creating a list that contains data frames, you can create a data.frame
within a data.frame
. This makes it easier to access and manipulate. You can access the inner data frame by adding second $
# Assigning new column `data` to hold data frames
index_df$data <- index_list
# We can access Description
index_df$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"
# [2] "Only sum scores for the UV Scale"
# Accessing Data
index_df$data$sub_data
# x y
#1 2 20
#2 3 30
#3 5 10
#4 1 302
#5 6 5
str(index_df)
# 'data.frame': 2 obs. of 3 variables:
# $ Data.Frame : chr "sub_data" "uv_newScale"
# $ Description: chr "Contains all sumscales + sociodemographics, names abbreviated" "Only sum scores for the UV Scale"
# $ data :List of 2
# ..$ sub_data :'data.frame': 5 obs. of 2 variables:
# .. ..$ x: num 2 3 5 1 6
# .. ..$ y: num 20 30 10 302 5
# ..$ uv_newScale:'data.frame': 5 obs. of 2 variables:
# .. ..$ item1: num 2 3 5 1 6
# .. ..$ item2: num 3 5 1 3 2
Efficiency
The first approach is more efficient than the second in terms of memory footprint. Here is a comparison between objects sizes in bytes.
# Creating one data frame within one data frame for comparison
list_df <- list(sub_data=sub_data)
dfs <- data.frame("Data.Frame"="sub_data",
"Description"="Contains all sumscales + sociodemographics, names abbreviated")
dfs$data <- list_df
# Adding attributes
attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
# Memory Size in bytes
object.size(dfs)
# 2376 bytes
object.size(sub_data)
# 1224 bytes
-
\$\begingroup\$ Perfect. Thanks this was exactly what I was looking for \$\endgroup\$SysRIP– SysRIP2021年04月20日 07:43:54 +00:00Commented Apr 20, 2021 at 7:43