Return to Answer

Commonmark migration

Source Link

edited Jun 10, 2020 at 13:24

Community Bot

edited Jun 10, 2020 at 13:24

Community Bot

###Potential problem with "real" document input

Potential problem with "real" document input

###Naming

Naming

###Potential problem with "real" document input

###Naming

Potential problem with "real" document input

Naming

added 281 characters in body

Source Link

edited Mar 5, 2016 at 7:48

Phrancis

edited Mar 5, 2016 at 7:48

Phrancis

20.5k
6
69
155

termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map

termdocmatrix -> TD-matrix-from-docs
terdocmmap -> TD-map-from-docs
tdseqs -> TD-seqs-from-TD-map

bigmap I don't think is a gooddescriptive name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}. I would be tempted to call it something like cartesian-product-map or perhaps just cartesian-map.

I would also suggest to perhaps change whitesplit to space-split, since that is really what it is doing (it is not splitting other whitespace like \r \n \t. Or if you want to make it a true whitespace-split, then you should change #" " to the #"\s" special character which includes "all whitespace". Here is an article on RegexOne about it.

termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map

bigmap I don't think is a good name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}. I would be tempted to call it something like cartesian-product-map or perhaps just cartesian-map.

termdocmatrix -> TD-matrix-from-docs
terdocmmap -> TD-map-from-docs
tdseqs -> TD-seqs-from-TD-map

bigmap I don't think is a descriptive name. What is "big" in this context? In truth it reminds me of a Cartesian product, since each document entry in the docs vector will return its own map of all possible words, e.g., {this 1, is 1, a 1, cat 1, dog 0, woof 0, and 0, meow 0, words 0}. I would be tempted to call it something like cartesian-product-map or perhaps just cartesian-map.

Source Link

answered Mar 5, 2016 at 7:14

Phrancis

answered Mar 5, 2016 at 7:14

Phrancis

20.5k
6
69
155

I'm pretty new at Clojure myself, and haven't studied the collection algorithms very much yet, so this may not address your performance concerns, but I did find a few things that could be improved.

###Potential problem with "real" document input

As I started going through your functions and how they work together, I noticed that your logic makes the assumption that all input will always just be words separated by spaces. Perhaps your data set is already preprocessed before it enters your termdocmatrix function? Unless that is the case, any text from actual documents written by humans will have many artifacts like punctuation marks and such that you should probably account for.

I ran these to illustrate what happens with more "natural" document text:

(def docs-punc ["this is, a cat" "this is a dog." "woof: and a meow" "woof; woof woof! meow? meow words"])
(whitesplit docs-punc)
; => ([this is, a cat] [this is a dog.] [woof: and a meow] [woof; woof woof! meow? meow words])
(termdocmatrix docs-punc)
; => [[:cat :dog. :is :this :woof: :is, :words :meow? :woof! :and :meow :woof; :woof :a] [1 0 0 1 0 1 0 0 0 0 0 0 0 1] [0 1 1 1 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 1 0 0 0 0 1 1 0 0 1] [0 0 0 0 0 0 1 1 1 0 1 1 1 0]]

That totally messed up the results as you can see. I added a strip-punc function at the top (I made punc-to-remove its own form for readability, personal preference), and a helper function to apply it to a vector of strings:

(defn strip-punc
 "remove punctuation marks in string using `punc-to-remove` capture pattern and replacing them with empty string"
 [str]
 (def punc-to-remove #"[.,;:!?$%&\*()]")
 (str/replace str punc-to-remove ""))
(defn vec-strip-punc
 "applies strip-punc to a vector of strings"
 [vec]
 (map #(strip-punc %) vec))

Then change your bigmap function accordingly to call it before you split the strings:

 (let [docs-no-punc (vec-strip-punc docs)
 stringvecs (whitesplit docs-no-punc)] ; etc.

Or alternatively inline style:

(let [stringvecs (whitesplit (vec-strip-punc docs))]

This will take care of pretty much all your general punctuation cases, and you can easily tweak the regex pattern as needed:

(def docs-punc ["this is, a cat%" "this $is a dog." "woof: and [a] meow*" "woof; (woof woof!) meow? meow words"])
(termdocmatrix docs-punc)
; => [[:cat :is :this :words :dog :and :meow :woof :a] [1 1 1 0 0 0 0 0 1] [0 1 1 0 1 0 0 0 1] [0 0 0 0 0 1 1 1 1] [0 0 0 1 0 0 2 3 0]]

###Naming

Your names don't follow typical Lisp naming convention. According to Wikipedia on naming conventions (programming):

Common practice in most Lisp dialects is to use dashes to separate words in identifiers, as in with-open-file and make-hash-table. Global variable names conventionally start and end with asterisks: *map-walls*. Constants names are marked by plus signs: +map-size+.

Also since most/all your functions actually transform your data structure, I would suggest naming them in a way that suggests that. Perhaps also using an acronym consistently, let's say td (or even TD) for term-document, that would make it read better without being really verbose.

termdocmatrix -> TD-matrix-from-docs terdocmmap -> TD-map-from-docs tdseqs -> TD-seqs-from-TD-map

lang-clj