###Potential problem with "real" document input
Potential problem with "real" document input
###Naming
Naming
###Potential problem with "real" document input
###Naming
Potential problem with "real" document input
Naming
termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map
termdocmatrix -> TD-matrix-from-docs
terdocmmap -> TD-map-from-docs
tdseqs -> TD-seqs-from-TD-map
bigmap
I don't think is a gooddescriptive name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs
vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}
. I would be tempted to call it something like cartesian-product-map
or perhaps just cartesian-map
.
I would also suggest to perhaps change whitesplit
to space-split
, since that is really what it is doing (it is not splitting other whitespace like \r \n \t
. Or if you want to make it a true whitespace-split, then you should change #" "
to the #"\s"
special character which includes "all whitespace". Here is an article on RegexOne about it.
termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map
bigmap
I don't think is a good name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs
vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}
. I would be tempted to call it something like cartesian-product-map
or perhaps just cartesian-map
.
termdocmatrix -> TD-matrix-from-docs
terdocmmap -> TD-map-from-docs
tdseqs -> TD-seqs-from-TD-map
bigmap
I don't think is a descriptive name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs
vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}
. I would be tempted to call it something like cartesian-product-map
or perhaps just cartesian-map
.
I would also suggest to perhaps change whitesplit
to space-split
, since that is really what it is doing (it is not splitting other whitespace like \r \n \t
. Or if you want to make it a true whitespace-split, then you should change #" "
to the #"\s"
special character which includes "all whitespace". Here is an article on RegexOne about it.
I'm pretty new at Clojure myself, and haven't studied the collection algorithms very much yet, so this may not address your performance concerns, but I did find a few things that could be improved.
###Potential problem with "real" document input
As I started going through your functions and how they work together, I noticed that your logic makes the assumption that all input will always just be words separated by spaces. Perhaps your data set is already preprocessed before it enters your termdocmatrix
function? Unless that is the case, any text from actual documents written by humans will have many artifacts like punctuation marks and such that you should probably account for.
I ran these to illustrate what happens with more "natural" document text:
(def docs-punc ["this is, a cat" "this is a dog." "woof: and a meow" "woof; woof woof! meow? meow words"]) (whitesplit docs-punc) ; => ([this is, a cat] [this is a dog.] [woof: and a meow] [woof; woof woof! meow? meow words]) (termdocmatrix docs-punc) ; => [[:cat :dog. :is :this :woof: :is, :words :meow? :woof! :and :meow :woof; :woof :a] [1 0 0 1 0 1 0 0 0 0 0 0 0 1] [0 1 1 1 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 1 0 0 0 0 1 1 0 0 1] [0 0 0 0 0 0 1 1 1 0 1 1 1 0]]
That totally messed up the results as you can see. I added a strip-punc
function at the top (I made punc-to-remove
its own form for readability, personal preference), and a helper function to apply it to a vector of strings:
(defn strip-punc
"remove punctuation marks in string using `punc-to-remove` capture pattern and replacing them with empty string"
[str]
(def punc-to-remove #"[.,;:!?$%&\*()]")
(str/replace str punc-to-remove ""))
(defn vec-strip-punc
"applies strip-punc to a vector of strings"
[vec]
(map #(strip-punc %) vec))
Then change your bigmap
function accordingly to call it before you split the strings:
(let [docs-no-punc (vec-strip-punc docs)
stringvecs (whitesplit docs-no-punc)] ; etc.
Or alternatively inline style:
(let [stringvecs (whitesplit (vec-strip-punc docs))]
This will take care of pretty much all your general punctuation cases, and you can easily tweak the regex pattern as needed:
(def docs-punc ["this is, a cat%" "this $is a dog." "woof: and [a] meow*" "woof; (woof woof!) meow? meow words"]) (termdocmatrix docs-punc) ; => [[:cat :is :this :words :dog :and :meow :woof :a] [1 1 1 0 0 0 0 0 1] [0 1 1 0 1 0 0 0 1] [0 0 0 0 0 1 1 1 1] [0 0 0 1 0 0 2 3 0]]
###Naming
Your names don't follow typical Lisp naming convention. According to Wikipedia on naming conventions (programming):
Common practice in most Lisp dialects is to use dashes to separate words in identifiers, as in
with-open-file
andmake-hash-table
. Global variable names conventionally start and end with asterisks:*map-walls*
. Constants names are marked by plus signs:+map-size+
.
Also since most/all your functions actually transform your data structure, I would suggest naming them in a way that suggests that. Perhaps also using an acronym consistently, let's say td
(or even TD
) for term-document
, that would make it read better without being really verbose.
termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map
bigmap
I don't think is a good name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs
vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}
. I would be tempted to call it something like cartesian-product-map
or perhaps just cartesian-map
.