unpivotr

Cran Status Cran Downloads codecov R-CMD-check

unpivotr deals with non-tabular data, especially from spreadsheets. Use unpivotr when your source data has any of these ‘features’:

If that list makes your blood boil, you’ll enjoy the function names.

More positive, corrective functions:

Make cells tidy

Unpivotr uses data where each cells is represented by one row in a dataframe. Like this.

Gif of tidyxl converting cells into a tidy representation of one row per cell

What can you do with tidy cells? The best places to start are:

Otherwise the basic idea is:

  1. Read the data with a specialist tool.
    • For spreadsheets, use tidyxl.
    • For plain text files, you might soon be able to use readr, but for now you’ll have to install a pull-request on that package with devtools::install_github("tidyverse/readr#760").
    • For tables in html pages, use unpivotr::tidy_html()
    • For data frames, use unpivotr::as_cells() – this should be a last resort, because by the time the data is in a conventional data frame, it is often too late – formatting has been lost, and most data types have been coerced to strings.
  2. Either behead() straight away, else dplyr::filter() separately for the header cells and the data cells, and then recombine with enhead().
  3. spatter() so that each column has one data type.
 library(unpivotr)
 library(tidyverse)
 #> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 #> ✔ dplyr 1.1.4 ✔ readr 2.1.5
 #> ✔ forcats 1.0.0 ✔ stringr 1.5.1
 #> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
 #> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
 #> ✔ purrr 1.0.2 
 #> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 #> ✖ dplyr::filter() masks stats::filter()
 #> ✖ dplyr::lag() masks stats::lag()
 #> ✖ tidyr::pack() masks unpivotr::pack()
 #> ✖ tidyr::unpack() masks unpivotr::unpack()
 #> i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
x <- purpose$`up-left left-up`
x # A pivot table in a conventional data frame. Four levels of headers, in two
 #> X2 X3 X4 X5 X6 X7
 #> 1 <NA> <NA> Female <NA> Male <NA>
 #> 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
 #> 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000
 #> 4 <NA> 25 - 44 12000 137000 9000 81000
 #> 5 <NA> 45 - 64 10000 64000 7000 66000
 #> 6 <NA> 65+ <NA> 18000 7000 17000
 #> 7 Certificate 15 - 24 29000 161000 30000 190000
 #> 8 <NA> 25 - 44 34000 179000 31000 219000
 #> 9 <NA> 45 - 64 30000 210000 23000 199000
 #> 10 <NA> 65+ 12000 77000 8000 107000
 #> 11 Diploma 15 - 24 <NA> 14000 9000 11000
 #> 12 <NA> 25 - 44 10000 66000 8000 47000
 #> 13 <NA> 45 - 64 6000 68000 5000 58000
 #> 14 <NA> 65+ 5000 41000 1000 34000
 #> 15 No Qualification 15 - 24 10000 43000 12000 37000
 #> 16 <NA> 25 - 44 11000 36000 21000 50000
 #> 17 <NA> 45 - 64 19000 91000 17000 75000
 #> 18 <NA> 65+ 16000 118000 9000 66000
 #> 19 Postgraduate qualification 15 - 24 <NA> 6000 <NA> <NA>
 #> 20 <NA> 25 - 44 5000 86000 7000 60000
 #> 21 <NA> 45 - 64 6000 55000 6000 68000
 #> 22 <NA> 65+ <NA> 13000 <NA> 18000
 # rows and two columns.
 
y <- as_cells(x) # 'Tokenize' or 'melt' the data frame into one row per cell
y
 #> # A tibble: 132 ×ばつ 4
 #> row col data_type chr 
 #> <int> <int> <chr> <chr> 
 #> 1 1 1 chr <NA> 
 #> 2 2 1 chr <NA> 
 #> 3 3 1 chr Bachelor's degree
 #> 4 4 1 chr <NA> 
 #> 5 5 1 chr <NA> 
 #> 6 6 1 chr <NA> 
 #> 7 7 1 chr Certificate 
 #> 8 8 1 chr <NA> 
 #> 9 9 1 chr <NA> 
 #> 10 10 1 chr <NA> 
 #> # i 122 more rows
 
 rectify(y) # useful for reviewing the melted form as though in a spreadsheet
 #> # A tibble: 22 ×ばつ 7
 #> `row/col` `1(A)` `2(B)` `3(C)` `4(D)` `5(E)` `6(F)`
 #> <int> <chr> <chr> <chr> <chr> <chr> <chr> 
 #> 1 1 <NA> <NA> Female <NA> Male <NA> 
 #> 2 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
 #> 3 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000 
 #> 4 4 <NA> 25 - 44 12000 137000 9000 81000 
 #> 5 5 <NA> 45 - 64 10000 64000 7000 66000 
 #> 6 6 <NA> 65+ <NA> 18000 7000 17000 
 #> 7 7 Certificate 15 - 24 29000 161000 30000 190000
 #> 8 8 <NA> 25 - 44 34000 179000 31000 219000
 #> 9 9 <NA> 45 - 64 30000 210000 23000 199000
 #> 10 10 <NA> 65+ 12000 77000 8000 107000
 #> # i 12 more rows
 
y %>%
 behead("up-left", "sex") %>% # Strip headers
 behead("up", "life-satisfication") %>% # one
 behead("left-up", "qualification") %>% # by
 behead("left", "age-band") %>% # one.
 select(-row, -col, -data_type, count = chr) %>% # cleanup
 mutate(count = as.integer(count))
 #> # A tibble: 80 ×ばつ 5
 #> count sex `life-satisfication` qualification `age-band`
 #> <int> <chr> <chr> <chr> <chr> 
 #> 1 7000 Female 0 - 6 Bachelor's degree 15 - 24 
 #> 2 12000 Female 0 - 6 Bachelor's degree 25 - 44 
 #> 3 10000 Female 0 - 6 Bachelor's degree 45 - 64 
 #> 4 NA Female 0 - 6 Bachelor's degree 65+ 
 #> 5 27000 Female 7 - 10 Bachelor's degree 15 - 24 
 #> 6 137000 Female 7 - 10 Bachelor's degree 25 - 44 
 #> 7 64000 Female 7 - 10 Bachelor's degree 45 - 64 
 #> 8 18000 Female 7 - 10 Bachelor's degree 65+ 
 #> 9 NA Male 0 - 6 Bachelor's degree 15 - 24 
 #> 10 9000 Male 0 - 6 Bachelor's degree 25 - 44 
 #> # i 70 more rows

Note the compass directions in the code above, which hint to behead() where to find the header cell for each data cell.

Installation

 # install.packages("devtools") # If you don't already have devtools
devtools::install_github("nacnudus/unpivotr", build_vignettes = TRUE)

The version 0.4.0 release had somee breaking changes. See NEWS.md for details. The previous version can be installed as follow:

devtools::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org")

Similar projects

unpivotr is inspired by Databaker, a collaboration between the United Kingdom Office of National Statistics and The Sensible Code Company. unpivotr.

jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms. unpivotr differs by being less magic, and equipping you to express what you want to do.

AltStyle によって変換されたページ (->オリジナル) /