1
\$\begingroup\$

For a project I have a large dataset with a multitude of variables from different questionnaires. Not all variables are required for all analyses.

So I created a preprocessing script, in which subsets of variables (with and without abbreviations) are created. However it gets confusing pretty fast. For convenience I decided to create a index_list which holds all data.frames as well as a data.frame called index_df which holds the name of the respective data.frame as well as a brief description of each subversion of the dataset.

######################## Preparation / Loading #####################
# Clean Out Global Environment
rm(list=ls())
# Detach all unnecessary pacakges
pacman::p_unload()
# Load Required Libraries
pacman::p_load(dplyr, tidyr, gridExtra, conflicted)
# Load Data
#source("00_Preprocess.R")
#create simulation data instead
sub_data <- data.frame(x=c(2,3,5,1,6),y=c(20,30,10,302,5))
uv_newScale <- data.frame(item1=c(2,3,5,1,6),item2=c(3,5,1,3,2))
# Resolving conflicted Namepsaces
conflict_prefer("filter", "dplyr")
# Creating an Index 
index_list <- list("sub_data"=sub_data,
 "uv_newScale"=uv_newScale
 )
index_df <- data.frame("Data.Frame"=c("sub_data",
 "uv_newScale"),
 "Description"=c("Contains all sumscales + sociodemographics, names abbreviated",
 "Only sum scores for the UV Scale"))

I am wondering if there is a more efficient way to do so. Like saving the data.frames together with the description in one container?

asked Mar 31, 2021 at 11:41
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

One approach

If the description is to serve only as a metadata, R allows you to add any number of metadata to objects with attributes feature. you can add it as an attribute for the data.frame object.

attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
# Description now shows in the data frame structure 
str(sub_data)
# 'data.frame': 5 obs. of 2 variables:
# $ x: num 2 3 5 1 6
# $ y: num 20 30 10 302 5
# - attr(*, "Description")= chr "Contains all sumscales + sociodemographics, names # abbreviated"
#You can access the Description only
attributes(sub_data)$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"

However, custom attributes come with limitations.They are not persistent when you perform certain operations on objects such as subsetting. Here is an example using same object with the new Description attribute we just added. If we subset the data, the custom attribute will be lost.

sub2_data <- sub_data[,"x", drop = FALSE]
attributes(sub2_data)$Description
# NULL

Alternative approach

You can use the same idea of creating a container within container. However, instead of creating a list that contains data frames, you can create a data.frame within a data.frame. This makes it easier to access and manipulate. You can access the inner data frame by adding second $

# Assigning new column `data` to hold data frames 
index_df$data <- index_list
# We can access Description 
index_df$Description
# [1] "Contains all sumscales + sociodemographics, names abbreviated"
# [2] "Only sum scores for the UV Scale" 
# Accessing Data 
index_df$data$sub_data
# x y
#1 2 20
#2 3 30
#3 5 10
#4 1 302
#5 6 5
str(index_df)
# 'data.frame': 2 obs. of 3 variables:
# $ Data.Frame : chr "sub_data" "uv_newScale"
# $ Description: chr "Contains all sumscales + sociodemographics, names abbreviated" "Only sum scores for the UV Scale"
# $ data :List of 2
# ..$ sub_data :'data.frame': 5 obs. of 2 variables:
# .. ..$ x: num 2 3 5 1 6
# .. ..$ y: num 20 30 10 302 5
# ..$ uv_newScale:'data.frame': 5 obs. of 2 variables:
# .. ..$ item1: num 2 3 5 1 6
# .. ..$ item2: num 3 5 1 3 2

Efficiency

The first approach is more efficient than the second in terms of memory footprint. Here is a comparison between objects sizes in bytes.

# Creating one data frame within one data frame for comparison
list_df <- list(sub_data=sub_data)
dfs <- data.frame("Data.Frame"="sub_data",
 "Description"="Contains all sumscales + sociodemographics, names abbreviated")
dfs$data <- list_df 
# Adding attributes
attr(sub_data, "Description") <- "Contains all sumscales + sociodemographics, names abbreviated"
 
# Memory Size in bytes
object.size(dfs)
# 2376 bytes
object.size(sub_data)
# 1224 bytes
answered Apr 16, 2021 at 10:29
\$\endgroup\$
1
  • \$\begingroup\$ Perfect. Thanks this was exactly what I was looking for \$\endgroup\$ Commented Apr 20, 2021 at 7:43

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.