Arrange entries in data.frame
Description
Analogous function for arrange in dplyr.
Usage
arrange(.data, ..., cols = NULL, order = 1L)
Arguments
.data
data.frame
...
Arrange by what group? Minus symbol means arrange by descending order.
cols
For set_arrange only.
A character vector of column names of .data by which to order.
If present, override .... Defaults to NULL.
order
For set_arrange only. An integer vector with only possible
values of 1 and -1, corresponding to ascending and descending order.
Defaults to 1.
Details
Once arranged, the order of entries would be changed forever.
Value
A data.table
See Also
Examples
a = as.data.table(iris)
a %>% arrange(Sepal.Length)
a
a %>% arrange(cols = c("Sepal.Width","Petal.Length"))
a
Save a data.frame as a fst table
Description
This function first export the data.frame to a temporal file, and then parse it back as a fst table (class name is "fst_table").
Usage
as_fst(.data)
Arguments
.data
A data.frame
Value
An object of class fst_table
Examples
iris %>%
as_fst() -> iris_fst
iris_fst
Complete a data frame with missing combinations of data
Description
Turns implicit missing values into explicit missing values.
Analogous function for complete function in tidyr.
Usage
complete(.data, ..., fill = NA)
Arguments
.data
data.frame
...
Specification of columns to expand.The selection of columns is
supported by the flexible select_dt .
To find all unique combinations of provided columns, including those not found in the data,
supply each variable as a separate argument. But the two modes (select the
needed columns and fill outside values) could not be mixed,
find more details in examples.
fill
Atomic value to fill into the missing cell, default uses NA.
Details
When the provided columns with addtion data are of different length, all the unique combinations would be returned. This operation should be used only on unique entries, and it will always returned the unique entries.
If you supply fill parameter, these values will also replace existing explicit missing values in the data set.
Value
data.table
See Also
Examples
df <- data.table(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete(item_id,item_name)
df %>% complete(item_id,item_name,fill = 0)
df %>% complete("item")
df %>% complete(item_id=1:3)
df %>% complete(item_id=1:3,group=1:2)
df %>% complete(item_id=1:3,group=1:3,item_name=c("a","b","c"))
Count observations by group
Description
Analogous function for count and add_count in dplyr.
Usage
count(.data, ..., sort = FALSE, name = "n")
add_count(.data, ..., name = "n")
Arguments
.data
data.table
...
variables to group by.
sort
logical. If TRUE result will be sorted in desending order by resulting variable.
name
character. Name of resulting variable. Default uses "n".
Value
data.table
Examples
a = as.data.table(mtcars)
count(a,cyl)
count(a,cyl,sort = TRUE)
a
b = as.data.table(iris)
b %>% add_count(Species,name = "N")
b
Cumulative mean
Description
Returns a vector whose elements are the cumulative mean of the elements of the argument.
Usage
cummean(x)
Arguments
x
a numeric or complex object, or an object that can be coerced to one of these.
Value
A numeric vector
Examples
cummean(1:10)
Select distinct/unique rows in data.table
Description
Analogous function for distinct in dplyr
Usage
distinct(.data, ..., .keep_all = FALSE)
Arguments
.data
data.table
...
Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
.keep_all
If TRUE, keep all variables in data.table. If a combination of ... is not distinct,
this keeps the first row of values.
Value
data.table
See Also
Examples
a = as.data.table(iris)
b = as.data.table(mtcars)
a %>% distinct(Species)
b %>% distinct(cyl,vs,.keep_all = TRUE)
Drop or delete data by rows or columns
Description
drop_na drops entries by specified columns.
delete_na deletes rows or columns with too many NAs.
Usage
drop_na(.data, ...)
delete_na(.data, MARGIN, n)
Arguments
.data
A data.table
...
Colunms to be dropped or deleted.
MARGIN
1 or 2. 1 for deleting rows, 2 for deleting columns.
n
If number (proportion) of NAs is larger than or equal to "n", the columns/rows would be deleted. When smaller than 1, use as proportion. When larger or equal to 1, use as number.
Value
A data.table
Examples
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x
x %>% delete_na(2,0.75)
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x %>% delete_na(2,0.5)
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x %>% delete_na(2,0.24)
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x %>% delete_na(2,2)
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x %>% delete_na(1,0.6)
x = data.table(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x %>% delete_na(1,2)
Fast creation of dummy variables
Description
Quickly create dummy (binary) columns from character and factor type columns in the inputted data (and numeric columns if specified.) This function is useful for statistical analysis when you want binary columns rather than character columns.
Usage
dummy(.data, ..., longname = TRUE)
Arguments
.data
data.frame
...
Columns you want to create dummy variables from. Very flexible, find in the examples.
longname
logical. Should the output column labeled with the
original column name? Default uses TRUE.
Details
If no columns provided, will return the original data frame.
This function is inspired by fastDummies package, but provides
simple and precise usage, whereas fastDummies::dummy_cols provides more
features for statistical usage.
Value
data.table
See Also
Examples
iris = as.data.table(iris)
iris %>% dummy(Species)
iris %>% dummy(Species,longname = FALSE)
mtcars = as.data.table(mtcars)
mtcars %>% head() %>% dummy(vs,am)
mtcars %>% head() %>% dummy("cyl|gear")
Read and write fst files
Description
Wrapper for read_fst and write_fst
from fst, but use a different default. For data import, always return a data.table.
For data export, always compress the data to the smallest size.
Usage
export_fst(x, path, compress = 100, uniform_encoding = TRUE)
import_fst(
path,
columns = NULL,
from = 1,
to = NULL,
as.data.table = TRUE,
old_format = FALSE
)
Arguments
x
a data frame to write to disk
path
path to fst file
compress
value in the range 0 to 100, indicating the amount of compression to use. Lower values mean larger file sizes. The default compression is set to 50.
uniform_encoding
If 'TRUE', all character vectors will be assumed to have elements with equal encoding. The encoding (latin1, UTF8 or native) of the first non-NA element will used as encoding for the whole column. This will be a correct assumption for most use cases. If 'uniform.encoding' is set to 'FALSE', no such assumption will be made and all elements will be converted to the same encoding. The latter is a relatively expensive operation and will reduce write performance for character columns.
columns
Column names to read. The default is to read all columns.
from
Read data starting from this row number.
to
Read data up until this row number. The default is to read to the last row of the stored dataset.
as.data.table
If TRUE, the result will be returned as a data.table object. Any keys set on
dataset x before writing will be retained. This allows for storage of sorted datasets. This option
requires data.table package to be installed.
old_format
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10.
Value
'import_fst' returns a data.table with the selected columns and rows. 'export_fst' writes 'x' to a 'fst' file and invisibly returns 'x' (so you can use this function in a pipeline).
See Also
Examples
export_fst(iris,"iris_fst_test.fst")
iris_dt = import_fst("iris_fst_test.fst")
iris_dt
unlink("iris_fst_test.fst")
Fill in missing values with previous or next value
Description
Fills missing values in selected columns using the next or previous entry.
Usage
fill(.data, ..., direction = "down")
shift_fill(x, direction = "down")
Arguments
.data
A data.table
...
A selection of columns.
direction
Direction in which to fill missing values. Currently either "down" (the default), "up".
x
A vector.
Details
fill is filling data.table's columns,
shift_fill is filling any vectors.
Value
A filled data.table
Examples
df <- data.table(Month = 1:12, Year = c(2000, rep(NA, 10),2001))
df
df %>% fill(Year)
df <- data.table(Month = 1:12, Year = c(2000, rep(NA, 10),2001))
df %>% fill(Year,direction = "up")
Filter entries in data.frame
Description
Analogous function for filter in dplyr.
Usage
filter(.data, ...)
Arguments
.data
data.frame
...
List of variables or name-value pairs of summary/modifications functions.
Details
Currently data.table is not able to delete rows by reference,
Value
A data.table
References
https://github.com/Rdatatable/data.table/issues/635
https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table
See Also
Examples
iris = as.data.table(iris)
iris %>% filter(Sepal.Length > 7)
iris %>% filter(Sepal.Length > 7,Sepal.Width > 3)
iris %>% filter(Sepal.Length > 7 & Sepal.Width > 3)
iris %>% filter(Sepal.Length == max(Sepal.Length))
Parse,inspect and extract data.table from fst file
Description
An API for reading fst file as data.table.
Usage
parse_fst(path)
slice_fst(ft, row_no)
select_fst(ft, ...)
filter_fst(ft, ...)
summary_fst(ft)
Arguments
path
path to fst file
ft
An object of class fst_table, returned by parse_fst
row_no
An integer vector (Positive)
...
The filter conditions
Details
summary_fst could provide some basic information about
the fst table.
Value
parse_fst returns a fst_table class.
select_fst and filter_fst returns a data.table.
See Also
Examples
# write the file first
path = tempfile(fileext = ".fst")
fst::write_fst(iris,path)
# parse the file but not reading it
parse_fst(path) -> ft
ft
class(ft)
lapply(ft,class)
names(ft)
dim(ft)
summary_fst(ft)
# get the data by query
ft %>% slice_fst(1:3)
ft %>% slice_fst(c(1,3))
ft %>% select_fst(Sepal.Length)
ft %>% select_fst(Sepal.Length,Sepal.Width)
ft %>% select_fst("Sepal.Length")
ft %>% select_fst(1:3)
ft %>% select_fst(1,3)
ft %>% select_fst("Se")
# return a warning with message
ft %>% select_fst("nothing")
ft %>% select_fst("Se|Sp")
ft %>% select_fst(cols = names(iris)[2:3])
ft %>% filter_fst(Sepal.Width > 3)
ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica")
ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3)
Group by one or more variables
Description
Most data operations are done on groups defined by variables.
group_by will group the data.table by selected variables (setting
them as keys), and arrange them in ascending order.
group_exe could do computations by group, it receives an object
returned by group_by.
Usage
group_by(.data, ...)
group_exe(.data, ...)
groups(x)
ungroup(x)
Arguments
.data
A data.table
...
For group_by:Variables to group by.
For group_exe:Any data manipulation arguments that
could be implemented on a data.table.
x
A data.table
Details
For mutate and summarise, it is recommended to
use the innate "by" parameter, which is faster. Once the data.table is
grouped, the order is changed forever.
groups() could return a character vector of specified groups.
ungroup() would delete the keys in data.table.
Value
A data.table with keys
Examples
a = as.data.table(iris)
a
a %>%
group_by(Species) %>%
group_exe(
head(3)
)
groups(a)
ungroup(a)
groups(a)
Join tables
Description
The mutating joins add columns from 'y' to 'x', matching rows based on the keys:
* 'inner_join()': includes all rows in 'x' and 'y'. * 'left_join()': includes all rows in 'x'. * 'right_join()': includes all rows in 'y'. * 'full_join()': includes all rows in 'x' or 'y'.
Filtering joins filter rows from 'x' based on the presence or absence of matches in 'y':
* 'semi_join()' return all rows from 'x' with a match in 'y'. * 'anti_join()' return all rows from 'x' without a match in 'y'.
Usage
inner_join(x, y, by = NULL, on = NULL)
left_join(x, y, by = NULL, on = NULL)
right_join(x, y, by = NULL, on = NULL)
full_join(x, y, by = NULL, on = NULL)
anti_join(x, y, by = NULL, on = NULL)
semi_join(x, y, by = NULL, on = NULL)
Arguments
x
A data.table
y
A data.table
by
(Optional) A character vector of variables to join by.
If 'NULL', the default, '*_join()' will perform a natural join, using all variables in common across 'x' and 'y'. A message lists the variables so that you can check they're correct; suppress the message by supplying 'by' explicitly.
To join by different variables on 'x' and 'y', use a named vector. For example, 'by = c("a" = "b")' will match 'x$a' to 'y$b'.
To join by multiple variables, use a vector with length > 1. For example, 'by = c("a", "b")' will match 'x$a' to 'y$a' and 'x$b' to 'y$b'. Use a named vector to match different variables in 'x' and 'y'. For example, 'by = c("a" = "b", "c" = "d")' will match 'x$a' to 'y$b' and 'x$c' to 'y$d'.
on
(Optional)
Indicate which columns in x should be joined with which columns in y.
Examples included:
1..by = c("a","b") (this is a must for set_full_join);
2..by = c(x1="y1", x2="y2");
3..by = c("x1==y1", "x2==y2");
4..by = c("a", V2="b");
5..by = .(a, b);
6..by = c("x>=a", "y<=b") or .by = .(x>=a, y<=b).
Value
A data.table
Examples
workers = fread("
name company
Nick Acme
John Ajax
Daniela Ajax
")
positions = fread("
name position
John designer
Daniela engineer
Cathie manager
")
workers %>% inner_join(positions)
workers %>% left_join(positions)
workers %>% right_join(positions)
workers %>% full_join(positions)
# filtering joins
workers %>% anti_join(positions)
workers %>% semi_join(positions)
# To suppress the message, supply 'by' argument
workers %>% left_join(positions, by = "name")
# Use a named 'by' if the join variables have different names
positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions'
workers %>% inner_join(positions2, by = c("name" = "worker"))
# the syntax of 'on' could be a bit different
workers %>% inner_join(positions2,on = "name==worker")
Fast lead/lag for vectors
Description
Analogous function for lead and lag in dplyr by
wrapping data.table's shift.
Usage
lead(x, n = 1L, fill = NA)
lag(x, n = 1L, fill = NA)
Arguments
x
A vector
n
a positive integer of length 1, giving the number of positions to lead or lag by. Default uses 1
fill
Value to use for padding when the window goes beyond the input length.
Default uses NA
Value
A vector
See Also
Examples
lead(1:5)
lag(1:5)
lead(1:5,2)
lead(1:5,n = 2,fill = 0)
Pivot data between long and wide
Description
Fast table pivoting from long to wide and from wide to long.
These functions are supported by dcast.data.table and melt.data.table
from data.table.
Usage
longer(.data, ..., name = "name", value = "value", na.rm = FALSE)
wider(.data, ..., name, value = NULL, fun = NULL, fill = NA)
Arguments
.data
A data.table
...
Columns for unchanged group. Flexible, see examples.
name
Name for the measured variable names column.
value
Name for the data values column(s).
na.rm
If TRUE, NA values will be removed from the molten data.
fun
Should the data be aggregated before casting?
Defaults to NULL, which uses length for aggregation.
If a function is provided, with aggregated by this function.
fill
Value with which to fill missing cells. Default uses NA.
Value
A data.table
See Also
Examples
stocks <- data.table(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks %>% longer(time)
stocks %>% longer(-(2:4)) # same
stocks %>% longer(-"X|Y|Z") # same
long_stocks = longer(stocks,"ti") # same as above except for assignment
long_stocks %>% wider(time,name = "name",value = "value")
# the unchanged group could be missed if all the rest will be used
long_stocks %>% wider(name = "name",value = "value")
Conversion between tidy table and named matrix
Description
Convenient fucntions to implement conversion between tidy table and named matrix.
Usage
mat_df(m)
df_mat(df, row, col, value)
Arguments
m
A matrix
df
A data.frame with at least 3 columns, one for row name, one for column name, and one for values. The names for column and row should be unique.
row
Unquoted expression of column name for row
col
Unquoted expression of column name for column
value
Unquoted expression of column name for values
Value
For mat_df, a data.frame.
For df_mat, a named matrix.
Examples
mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3]))
mm
tdf = mat_df(mm)
tdf
mat = df_mat(tdf,row,col,value)
setequal(mm,mat)
tdf %>%
setNames(c("A","B","C")) %>%
df_mat(A,B,C)
Create or transform variables
Description
mutate() adds new variables and preserves existing ones;
transmute() adds new variables and drops existing ones.
Both functions preserve the number of rows of the input.
New variables overwrite existing variables of the same name.
mutate_when integrates mutate and case_when
in dplyr and make a new tidy verb for data.table. mutate_vars is
a super function to do updates in specific columns according to conditions.
If you mutate a data.table, it is forever changed.
No copies made, which is efficient, but should be used with caution.
If you still want the keep the original data.table, use
copy first.
Usage
mutate(.data, ..., by)
transmute(.data, ..., by)
mutate_when(.data, when, ..., by)
mutate_vars(.data, .cols = NULL, .func, ..., by)
Arguments
.data
A data.table
...
Name-value pairs of expressions
by
(Optional) Mutate by what group?
when
An object which can be coerced to logical mode
.cols
Any types that can be accepted by select_dt .
.func
Function to be run within each column, should return a value or vectors with same length.
Value
A data.table
Examples
# Newly created variables are available immediately
a = as.data.table(mtcars)
copy(a) %>% mutate(cyl2 = cyl * 2)
a
# change forever
a %>% mutate(cyl2 = cyl * 2)
a
# You can also use mutate() to remove variables and
# modify existing variables
a %>% mutate(
mpg = NULL,
disp = disp * 0.0163871 # convert to litres
)
a %>% transmute(cyl,one = 1)
a
iris[3:8,] %>%
as.data.table() %>%
mutate_when(Petal.Width == .2,
one = 1,Sepal.Length=2)
iris[3:8,] %>%
as.data.table() %>%
mutate_vars("Pe",scale)
Nest and unnest
Description
Analogous function for nest and unnest in tidyr.
unnest will automatically remove other list-columns except for the
target list-columns (which would be unnested later). Also, squeeze is
designed to merge multiple columns into list column.
Usage
nest(.data, ..., mcols = NULL, .name = "ndt")
unnest(.data, ...)
squeeze(.data, ..., .name = "ndt")
chop(.data, ...)
unchop(.data, ...)
Arguments
.data
data.table, nested or unnested
...
The variables for nest group(for nest),
columns to be nested(for squeeze and chop),
or column(s) to be unnested(for unnest).
Could recieve anything that select_dt could receive.
mcols
Name-variable pairs in the list, form like
.name
Character. The nested column name. Defaults to "ndt".
list(petal="^Pe",sepal="^Se"), see example.
Details
In the nest, the data would be nested to a column named 'ndt',
which is short for nested data.table.
The squeeze would not remove the originial columns.
The unchop is the reverse operation of chop.
These functions are experiencing the experimental stage, especially
the unnest. If they don't work on some circumtances, try tidyr
package.
Value
data.table, nested or unnested
References
https://www.r-bloggers.com/much-faster-unnesting-with-data-table/
https://stackoverflow.com/questions/25430986/create-nested-data-tables-by-collapsing-rows-into-new-data-tables
See Also
Examples
mtcars = as.data.table(mtcars)
iris = as.data.table(iris)
# examples for nest
# nest by which columns?
mtcars %>% nest(cyl)
mtcars %>% nest("cyl")
mtcars %>% nest(cyl,vs)
mtcars %>% nest(vs:am)
mtcars %>% nest("cyl|vs")
mtcars %>% nest(c("cyl","vs"))
# nest two columns directly
iris %>% nest(mcols = list(petal="^Pe",sepal="^Se"))
# nest more flexibly
iris %>% nest(mcols = list(ndt1 = 1:3,
ndt2 = "Pe",
ndt3 = Sepal.Length:Sepal.Width))
# examples for unnest
# unnest which column?
mtcars %>% nest("cyl|vs") %>%
unnest(ndt)
mtcars %>% nest("cyl|vs") %>%
unnest("ndt")
df <- data.table(
a = list(c("a", "b"), "c"),
b = list(c(TRUE,TRUE),FALSE),
c = list(3,c(1,2)),
d = c(11, 22)
)
df
df %>% unnest(a)
df %>% unnest(2)
df %>% unnest("c")
df %>% unnest(cols = names(df)[3])
# You can unnest multiple columns simultaneously
df %>% unnest(1:3)
df %>% unnest(a,b,c)
df %>% unnest("a|b|c")
# examples for squeeze
# nest which columns?
iris %>% squeeze(1:2)
iris %>% squeeze("Se")
iris %>% squeeze(Sepal.Length:Petal.Width)
# examples for chop
df <- data.table(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop(y,z)
df %>% chop(y,z) %>% unchop(y,z)
Extract the nth value from a vector
Description
Get the value from a vector with its position.
Usage
nth(v, n = 1)
Arguments
v
A vector
n
A single integer specifying the position. Default uses 1.
Negative integers index from the end
(i.e. -1L will return the last value in the vector).
If a double is supplied, it will be silently truncated.
Value
A single value.
Examples
x = 1:10
nth(x, 1)
nth(x, 5)
nth(x, -2)
Nice printing of report the Space Allocated for an Object
Description
Provides an estimate of the memory that is being used to store an R object. A wrapper of 'object.size', but use a nicer printing unit.
Usage
object_size(object)
Arguments
object
an R object.
Value
An object of class "object_size"
Examples
iris %>% object_size()
Pull out a single variable
Description
Analogous function for pull in dplyr
Usage
pull(.data, col)
Arguments
.data
data.frame
col
A name of column or index (should be positive).
Value
A vector
See Also
Examples
mtcars %>% pull(2)
mtcars %>% pull(cyl)
mtcars %>% pull("cyl")
Convenient file reader
Description
A wrapper of fread in data.table.
Highlighting the encoding.
Usage
read_csv(path, utf8 = FALSE, ...)
Arguments
path
File name in working directory, path to file.
utf8
Should "UTF-8" used as the encoding? (Defaults to FALSE)
...
Other parameters passed to data.table::fread.
Value
A data.table
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- data.table
as.data.table,CJ,copy,data.table,fcoalesce,fread,fwrite,rbindlist,rleid,rleidv,setDT,setnames,tables,transpose,uniqueN- stringr
Change column order
Description
Use 'relocate()' to change column positions, using the same syntax as 'select()'. Check similar function as 'relocate()' in dplyr.
Usage
relocate(.data, ..., how = "first", where = NULL)
Arguments
.data
A data.table
...
Columns to move
how
The mode of movement, including "first","last","after","before". Default uses "first".
where
Destination of columns selected by ....
Applicable for "after" and "before" mode.
Details
Once you relocate the columns, the order changes forever.
Value
A data.table with rearranged columns.
Examples
df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df
df %>% relocate(f)
df %>% relocate(a,how = "last")
df %>% relocate(is.character)
df %>% relocate(is.numeric, how = "last")
df %>% relocate("[aeiou]")
df %>% relocate(a, how = "after",where = f)
df %>% relocate(f, how = "before",where = a)
df %>% relocate(f, how = "before",where = c)
df %>% relocate(f, how = "after",where = c)
df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
df2 %>% relocate(is.numeric,
how = "after",
where = is.character)
df2 %>% relocate(is.numeric,
how="before",
where = is.character)
Fast value replacement in data frame
Description
replace_vars could replace any value(s) or values
that match specific patterns to another specific value in a data.table.
Usage
replace_vars(.data, ..., from = is.na, to)
Arguments
.data
A data.table
...
Colunms to be replaced. If not specified, use all columns.
from
A value, a vector of values or a function returns a logical value.
Defaults to NaN.
to
A value.
Value
A data.table.
See Also
Examples
iris %>% as.data.table() %>%
mutate(Species = as.character(Species))-> new_iris
new_iris %>%
replace_vars(Species, from = "setosa",to = "SS")
new_iris %>%
replace_vars(Species,from = c("setosa","virginica"),to = "sv")
new_iris %>%
replace_vars(Petal.Width, from = .2,to = 2)
new_iris %>%
replace_vars(from = .2,to = NA)
new_iris %>%
replace_vars(is.numeric, from = function(x) x > 3, to = 9999 )
Computation by rows
Description
Compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn't exist. Only mutate and summarise are supported so far.
Usage
rowwise_mutate(.data, ...)
rowwise_summarise(.data, ...)
Arguments
.data
A data.table
...
Name-value pairs of expressions
Value
A data.table
See Also
Examples
# without rowwise
df <- data.table(x = 1:2, y = 3:4, z = 4:5)
df %>% mutate(m = mean(c(x, y, z)))
# with rowwise
df <- data.table(x = 1:2, y = 3:4, z = 4:5)
df %>% rowwise_mutate(m = mean(c(x, y, z)))
# # rowwise is also useful when doing simulations
params = fread(" sim n mean sd
1 1 1 1
2 2 2 4
3 3 -1 2")
params %>%
rowwise_summarise(sim,z = rnorm(n,mean,sd))
Select/rename variables by name
Description
Choose or rename variables from a data.table.
select() keeps only the variables you mention;
rename() keeps all variables.
Usage
select(.data, ...)
select_vars(.data, ..., rm.dup = TRUE)
select_dt(.data, ..., cols = NULL, negate = FALSE)
select_mix(.data, ..., rm.dup = TRUE)
rename(.data, ...)
Arguments
.data
A data.table
...
One or more unquoted expressions separated by commas.
Very flexible, same as tidyfst::select_dt and tidyfst::select_mix.
details find select_dt .
rm.dup
Should duplicated columns be removed? Defaults to TRUE.
cols
(Optional)A numeric or character vector.
negate
Applicable when regular expression and "cols" is used.
If TRUE, return the non-matched pattern. Default uses FALSE.
Details
No copy is made. Once you select or rename a data.table,
they would be changed forever. select_vars could select across
different data types, names and index. See examples.
select_dt and select_mix is the safe mode of
select and select_vars, they keey the original copy but
are not memory-efficient when dealing with large data sets.
Value
A data.table
See Also
Examples
a = as.data.table(iris)
a %>% select(1:3)
a
a = as.data.table(iris)
a %>% select_vars(is.factor,"Se")
a
a = as.data.table(iris)
a %>% select("Se") %>%
rename(sl = Sepal.Length,
sw = Sepal.Width)
a
DT = data.table(a=1:2,b=3:4,c=5:6)
DT
DT %>% rename(B=b)
Separate a character column into two columns using a regular expression separator
Description
Given either regular expression,
separate() turns a single character column into two columns.
Analogous to tidyr::separate, but only split into two columns only.
Usage
separate(.data, separated_colname, into, sep = "[^[:alnum:]]+", remove = TRUE)
Arguments
.data
A data frame.
separated_colname
Column name, string only.
into
Character vector of length 2.
sep
Separator between columns.
remove
If TRUE, remove input column from output data frame.
Value
A data.table
See Also
Examples
df <- data.table(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
# equals to
df <- data.table(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate("x", c("A", "B"))
Subset rows using their positions
Description
'slice()' lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. It is accompanied by a number of helpers for common use cases:
* 'slice_head()' and 'slice_tail()' select the first or last rows. * 'slice_sample()' randomly selects rows. * 'slice_min()' and 'slice_max()' select rows with highest or lowest values of a variable.
Usage
slice(.data, ...)
slice_head(.data, n)
slice_tail(.data, n)
slice_max(.data, order_by, n, with_ties = TRUE)
slice_min(.data, order_by, n, with_ties = TRUE)
slice_sample(.data, n, replace = FALSE)
Arguments
.data
A data.table
...
Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative.
n
When larger than or equal to 1, the number of rows. When between 0 and 1, the proportion of rows to select.
order_by
Variable or function of variables to order by.
with_ties
Should ties be kept together? The default, 'TRUE', may return more rows than you request. Use 'FALSE' to ignore ties, and return the first 'n' rows.
replace
Should sampling be performed with ('TRUE') or without ('FALSE', the default) replacement.
Value
A data.table
See Also
Examples
a = as.data.table(iris)
slice(a,1,2)
slice(a,2:3)
slice_head(a,5)
slice_head(a,0.1)
slice_tail(a,5)
slice_tail(a,0.1)
slice_max(a,Sepal.Length,10)
slice_max(a,Sepal.Length,10,with_ties = FALSE)
slice_min(a,Sepal.Length,10)
slice_min(a,Sepal.Length,10,with_ties = FALSE)
slice_sample(a,10)
slice_sample(a,0.1)
Summarise columns to single values
Description
Create one or more scalar variables summarizing the variables of an existing data.table.
Usage
summarise(.data, ..., by = NULL)
summarise_when(.data, when, ..., by = NULL)
summarise_vars(.data, .cols = NULL, .func, ..., by)
Arguments
.data
A data.table
...
List of variables or name-value pairs of summary/modifications
functions for summarise_dt.Additional parameters to be passed to
parameter '.func' in summarise_vars.
by
Unquoted name of grouping variable of list of unquoted names of grouping variables. For details see data.table
when
An object which can be coerced to logical mode
.cols
Columns to be summarised.
.func
Function to be run within each column, should return a value or vectors with same length.
Value
A data.table
Examples
a = as.data.table(iris)
a %>% summarise(sum = sum(Sepal.Length),avg = mean(Sepal.Length))
a %>%
summarise_when(Sepal.Length > 5, avg = mean(Sepal.Length), by = Species)
a %>%
summarise_vars(is.numeric, min, by = Species)
Convenient print of time taken
Description
Convenient printing of time elapsed. A wrapper of
data.table::timetaken, but showing the results more directly.
Usage
sys_time_print(expr)
Arguments
expr
Valid R expression to be timed.
Value
A character vector of the form HH:MM:SS, or SS.MMMsec if under 60 seconds. See examples.
See Also
Examples
sys_time_print(Sys.sleep(1))
a = as.data.table(iris)
sys_time_print({
res = a %>%
mutate(one = 1)
})
res
"Uncount" a data frame
Description
Performs the opposite operation to 'dplyr::count()', duplicating rows according to a weighting variable (or expression). Analogous to 'tidyr::uncount'.
Usage
uncount(.data, wt, .remove = TRUE)
Arguments
.data
A data.frame
wt
A vector of weights.
.remove
Should the column for weights be removed?
Default uses TRUE.
Value
A data.table
See Also
Examples
df <- data.table(x = c("a", "b"), n = c(1, 2))
uncount(df, n)
uncount(df,n,FALSE)
Unite multiple columns into one by pasting strings together
Description
Convenience function to paste together multiple columns into one.
Analogous to tidyr::unite.
Usage
unite(.data, united_colname, ..., sep = "_", remove = FALSE, na2char = FALSE)
Arguments
.data
A data frame.
united_colname
The name of the new column, string only.
...
A selection of columns. If want to select all columns, pass "" to the parameter. See example.
sep
Separator to use between values.
remove
If TRUE, remove input columns from output data frame.
na2char
If FALSE, missing values would be merged into NA,
otherwise NA is treated as character "NA". This is different from
tidyr.
Value
A data.table
See Also
Examples
df <- CJ(x = c("a", NA), y = c("b", NA))
df
# Treat missing value as NA, default
df %>% unite("z", x:y, remove = FALSE)
# Treat missing value as character "NA"
df %>% unite("z", x:y, na2char = TRUE, remove = FALSE)
# the unite has memory, "z" would not be removed in new operations
# here we remove the original columns ("x" and "y")
df %>% unite("xy", x:y,remove = TRUE)
# Select all columns
iris %>% as.data.table %>% unite("merged_name",".")
Use UTF-8 for character encoding in a data frame
Description
fread from data.table could not recognize the encoding
and return the correct form, this could be unconvenient for text mining tasks. The
utf8-encoding could use "UTF-8" as the encoding to override the current
encoding of characters in a data frame.
Usage
utf8_encoding(.data, .cols)
Arguments
.data
A data.frame.
.cols
The columns you want to convert, usually a character column.
Value
A data.table with characters in UTF-8 encoding
Examples
iris %>%
as.data.table() %>%
utf8_encoding(Species) # could also use `is.factor`