This module deals with two opaque structure types, data-set and data-set-field. These are not available to clients directly although certain accessors are exported by this module. Conceptually a data-set is a table of data, columns represent fields that are either features that represent properties of an instance, and classifiers or labels that are used to train and match instances.
> (requirerml/data)> (definedataset'csv(list#t
(sepal-length sepal-width petal-length petal-width)
(classification)
1
135
(Iris-versicolor Iris-virginica Iris-setosa)
In this code block a training data set is loaded and the columns within the CSV data are described.
predicate
( data-set-field? a)→boolean?
a:any
predicate
( partition-id? a)→boolean?
a:any
procedure
( load-data-set file-nameformatfields)→data-set?
file-name:string?format:symbol?
value
supported-formats :(listof symbol?)
constructor
( make-feature name#:indexinteger?)→(data-set-field? )
name:string?integer?:0
constructor
( make-classifier name#:indexinteger?)→(data-set-field? )
name:string?integer?:0
accessor
( classifiers dataset)→(listof string?)
dataset:data-set?
accessor
( classifier-product dataset)→(listof string?)
dataset:data-set?
accessor
( data-count dataset)→exact-nonnegative-integer?
dataset:data-set?
accessor
partition-iddataset:data-set?partition-id:exact-nonnegative-integer?feature-name:string?
accessor
( partition-count dataset)→exact-nonnegative-integer?
dataset:data-set?
dataset:data-set?partition-id:exact-nonnegative-integer?
value
default-partition :exact-nonnegative-integer?
value
test-partition :exact-nonnegative-integer?
value
training-partition :exact-nonnegative-integer?
The following procedures perform transformations on one or more data-set structures and return a new data-set. These are typically concerned with partitioning a data set or optimizing the feature vectors.
procedure
partition-count:exact-positive-integer?
procedure
If specified, the entropy-features list denotes the names of features, or classifiers, that should be randomly spread across partitions.
parameter
( minimum-partition-data-total )→exact-positive-integer?
partition-data-count:exact-positive-integer?= 100
parameter
( minimum-partition-data )→exact-positive-integer?
partition-data-count:exact-positive-integer?= 100
Loading and manipulating data sets from source files may not always be efficient and so the parsed in-memory format can be saved and loaded externally. These saved forms are termed snapshots, they are serialized forms of the data-set structure.
io
( write-snapshot datasetout)→void?
dataset:data-set?out:output-port?
io
( read-snapshot datasetin)→data-set?
dataset:data-set?in:input-port?