While fiddling around, I've made a very simple and naive function to retrieve any repeated sub-string within a certain string.
function getRepetitions(str, length){
if(isNaN(length))
{
throw new TypeError('length must be a number');
}
if(length < 1 || length > 2147483647)
{
throw new RangeError('length must be between 1 and 2147483647');
}
var copy = str.toString();
if(length > (copy.length / 2))
{
length = Math.ceil(copy.length / 2);
}
var x = {};
for(var i = str.length - length - 1; i > 1; i--)
{
piece = copy.substr(i, length);
if(!x[piece] && copy.indexOf(piece) !== i)
{
x[piece] = copy.split(piece).length - 1;
}
}
return x;
}
It accepts anything that can be converted into a string, and expects the length of the substring to be found.
The length must be withing a certain range and must be a valid number. It will be converted into a 32-bit signed integer later on.
This will be part of a larger project, and just want to make sure it is as polished as it can be.
Performance:
Performance is quite satisfactory. It takes around 2.5s to search 9-bytes long repetitions, on a 8.58MB string, on my machine, using Google Chrome v47.0.2526.106.
Using IE11, it takes over twice the time (around 5.6s)!
Screenshots: Google Chrome, IE11.
The time it takes is always around the same (with very small variations), which is a good thing.
Running this code on a string that is 10% the size will take 10% of the time. That is, for a string that is 0.86MB it will take around 250ms.
This is the code used to test:
var str = Array(1000001).join('123456789');
console.time('x');
getRepetitions(str, 9);
console.timeEnd('x');
console.log((str.length / 1024 / 1024).toFixed(2) + 'MB');
Note: There are a few edge cases, such as (very rarely) substrings like prototype
and __proto__
going untouched. I am aware of those and I have no idea how to avoid it.
In terms of performance, readability, boundary checking and reliability, what can be improved?
2 Answers 2
Styling and readability
You are using a variable length
. It is unclear what this variable is, or does, without looking at your explanation in the question. It appears that length
is a number which represents the length of the substrings you want to find. I would recommend changing the name of this variable to better represent what it is.
You are using a variable str
. When reading this, I expect this variable to be a string, but from your description this function accepts "anything that can be converted into a string". To avoid confusion I would recommend renaming this variable to something that better represents what it contains, such as source
.
You are using a variable x
which does not describe what it contains.
You check the validity of length
, but you do not do this for str
. You can check the validity of str
with str.toString !== undefined
. Instead of a cryptic TypeError (variable.toString is not a function
) you can give a better error description.
Javascript processes variable declarations in a context before executing the code. This process is called variable hoisting. This means that when you declare variables in the middle of a function, they are moved to the top. To avoid confusion why variables are defined before you declare them, or why local variables are used when you expected global variables to be used, I would recommend declaring all variables at the beginning of the context.
You are dropping the variable piece
into the global scope by forgetting to declaring it with var
. This is another reason to declare them at the top!
Logic errors
You have an if-statement in your code that seems to serve no purpose other than causing the result to be incorrect. If the substring you want to find is larger or equal to the length of the data passed, you cannot find repeated substrings. You should remove this if-clause.
if(length > (copy.length / 2))
{
length = Math.ceil(copy.length / 2);
}
You use the .toString()
method on a variable of an unknown type. The result is a string. At a later stage, you request the length of str
, which is still of an unknown type. This causes the code to be incorrect when str
is not actually a string. This error is likely caused by using a variable name that does not represent the type, which is a good argument why you should use a better name for that variable.
var copy = str.toString();
//...
for(var i = str.length - length - 1; i > 1; i--)
If your string contains reserved words with length length
, your code will fail as you experienced. You can work around this by prepending a character such as #
in keys. You'll need to remove that character again when you use the key. As far as I am aware, no reserved properties beginning with a #
exist in javascript/ES.
Your implementation has two off-by-one errors in the following line. With a length length
and a string s
, you start your loop at length + 1
before the end of the string, causing you to ignore the last character in s
. Your code fails to produce the right result for the string "test.........test"
, with a substring length of 4. You end your loop with i > 1
, and this will cause the code to produce the incorrect result when the substring starting at 0 and the substring starting at 1 are the same. Your code will fail for the string "11111..........."
. You can safely ignore the substring starting at index 0, because indexOf
will always be 0.
for(var i = str.length - length - 1; i > 1; i--)
Note: You use !x[piece]
, which would fail if it contained a falsy value. Since it contains a value that is guaranteed >= 1
this is not an issue in this case.
Note: Your algorithm will find repeating substrings even if the substrings overlap. I am assuming this is intended behaviour. When substrings overlap copy.split(piece).length - 1
will not be equal to the amount of times the substring appears in the string.
Performance
Your performance tests give a skewed result. The data you passed is highly repetitive, causing more than 99% of your code to skip the indexOf
test. If indexOf
was implemented in a very naive way, the worst case would need to do (n * n) / 2
comparisons. indexOf
seems to be implemented slightly better, but only slightly.
var str = Array(10000000).join("1");
var lookup1 = "2222222222222222";
var lookup2 = "1111111111111112";
console.time( "Lookup 1" );
str.indexOf( lookup1 );
console.timeEnd( "Lookup 1" );
console.time( "Lookup 2" );
str.indexOf( lookup2 );
console.timeEnd( "Lookup 2" );
console.log( "Some lookups are handled quite fast" );
If I execute your code on 10 random string of 100.000 alphanumeric characters, finding substrings of 9 characters, it costs about 2.959 ms per string to find all repetitions. Executing your code on 10 random strings of 200.000 alphanumeric characters, finding substrings of 9 characters, it costs about 11.815 ms. This is significantly worse than "twice as long".
Your code can be easily improved by using hashmaps. Browser vendors use hashmaps to store properties of objects1 . Using an object to store found substrings will cost more memory, but will improve performance significantly. For 10 random strings of 100.000 alphanumeric characters, finding substrings of 9 characters, it costs about 126 ms per string to find all repetitions. For 10 random strings of 200.000 alphanumeric characters, finding substrings of 9 characters, it costs about 313 ms per string to find all repetitions. For 10 random strings of 1.000.000 alphanumeric characters, finding substrings of 9 characters, it costs about 1.659 ms per string to find all repetitions.
Testing on your fixed strings will make the improved versions perform worse, because two lookups have to be done instead of one. The difference for random strings is between seconds and several hours.
function getRepetitions(source, substringLength) {
var copy;
var piece;
var output;
var i;
if (isNaN(length)) {
throw new TypeError('length must be a number');
}
if (substringLength < 1 || substringLength > 2147483647) {
throw new RangeError('length must be between 1 and 2147483647');
}
if (source.toString === undefined) {
throw new TypeError('must be able to convert source to string via toString method');
}
copy = source.toString();
output = {};
for (i = copy.length - substringLength; i > 0; i--) {
piece = copy.substr(i, substringLength);
if (!output[piece] && copy.indexOf(piece) !== i) {
output[piece] = copy.split(piece).length - 1;
}
}
return output;
}
function getRepetitions2(source, substringLength) {
var copy;
var piece;
var output;
var i;
var foundSubstrings;
if (isNaN(length)) {
throw new TypeError('length must be a number');
}
if (substringLength < 1 || substringLength > 2147483647) {
throw new RangeError('length must be between 1 and 2147483647');
}
if (source.toString === undefined) {
throw new TypeError('must be able to convert source to string via toString method');
}
copy = source.toString();
output = {};
foundSubstrings = {};
for (i = copy.length - substringLength; i >= 0; i--) {
piece = copy.substr(i, substringLength);
if (!foundSubstrings[piece]) {
foundSubstrings[piece] = true;
} else if (!output[piece]) {
output[piece] = copy.split(piece).length - 1;
}
}
return output;
}
function getRepetitions2WithoutKeywords(source, substringLength) {
var copy;
var piece;
var output;
var i;
var foundSubstrings;
var key;
if (isNaN(length)) {
throw new TypeError('length must be a number');
}
if (substringLength < 1 || substringLength > 2147483647) {
throw new RangeError('length must be between 1 and 2147483647');
}
if (source.toString === undefined) {
throw new TypeError('must be able to convert source to string via toString method');
}
copy = source.toString();
output = {};
foundSubstrings = {};
for (i = copy.length - substringLength; i >= 0; i--) {
piece = copy.substr(i, substringLength);
key = "#" + piece;
if (!foundSubstrings[key]) {
foundSubstrings[key] = true;
} else if (!output[key]) {
output[key] = copy.split(piece).length - 1;
}
}
return output;
}
function test() {
//https://stackoverflow.com/a/1349426/2209007
var possibleChars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ';
var str = '';
var nrOfTests = 10;
var strLength = 100000;
var i;
var testNr;
var substrLength = 9;
console.log('== Testing random strings ==');
for (testNr = 1; testNr <= nrOfTests; testNr++) {
str = '';
console.time('Test ' + testNr + ': Generating string');
for (i = 0; i < strLength; i++) {
str += possibleChars.charAt(Math.floor(Math.random() * possibleChars.length));
}
console.timeEnd('Test ' + testNr + ': Generating string');
console.log('Test ' + testNr + ': Working with string "' + str.substring(0, 100) + '...\"');
console.time('Test ' + testNr + ': Default repetitions');
getRepetitions(str, substrLength);
console.timeEnd('Test ' + testNr + ': Default repetitions');
console.time('Test ' + testNr + ': Improved repetitions');
getRepetitions2(str, substrLength);
console.timeEnd('Test ' + testNr + ': Improved repetitions');
console.time('Test ' + testNr + ': Improved repetitions without keywords');
getRepetitions2WithoutKeywords(str, substrLength);
console.timeEnd('Test ' + testNr + ': Improved repetitions without keywords');
}
console.log('== Testing fixed string ==');
str = Array(1000001).join("123456789");
console.log('Test ' + testNr + ': Working with string \"' + str.substring(0, 100) + '...\"');
console.time('Test ' + testNr + ': Default repetitions');
getRepetitions(str, 9);
console.timeEnd('Test ' + testNr + ': Default repetitions');
console.time('Test ' + testNr + ': Improved repetitions');
getRepetitions2(str, 9);
console.timeEnd('Test ' + testNr + ': Improved repetitions');
console.time('Test ' + testNr + ': Improved repetitions without keywords');
getRepetitions2WithoutKeywords(str, 9);
console.timeEnd('Test ' + testNr + ': Improved repetitions without keywords');
}
test();
-
\$\begingroup\$ I won't extend too much on my comment, for now, since time is a bit short. But, well done! You captured a bunch of mistakes I would never catch. (That is the effect of not sleeping for 2 days and writting code.) \$\endgroup\$Ismael Miguel– Ismael Miguel2015年12月21日 14:38:55 +00:00Commented Dec 21, 2015 at 14:38
I called your function in the following way:
var result = getRepetitions('abcabcabc', 3);
Previously I added this console.log:
piece = copy.substr(i, length);
console.log(piece);
The result is:
cab
bca
abc
cab
Object { cab=2, bca=2, abc=3}
Then I modified this line:
for(var i = str.length - length - 1; i > 1; i--)
I don't understand the - 1. so I use this:
for(var i = str.length - length; i > 1; i--)
The result:
abc
cab
bca
abc
cab
Object { abc=3, cab=2, bca=2}
Why do you don't start where you can have your first possible combination?
-
\$\begingroup\$ That point has already been addressed on Sumurai8's answer. He talks about 2 off-by-one errors. That is one of the off-by-one mistakes. \$\endgroup\$Ismael Miguel– Ismael Miguel2015年12月23日 14:05:19 +00:00Commented Dec 23, 2015 at 14:05
Explore related questions
See similar questions with these tags.
length
tosubstringLength
. \$\endgroup\$length = Math.ceil(copy.length / 2);
– for string like"aaaaaa"
of length 6 there exist a repeated substring of length 5 (although the two substrings overlap). \$\endgroup\$str
, that is an arrayf[]
of pointers tostr[0]
,str[1]
etc.; then you sort thef
array alphabetically. Once sorted, you scan it for all strings that have sameN
first characters. Since all such strings groups occupy consecutive places, the scan is linear. However I have no idea whether/how it can be implemented in javascript (without makingstr.length
copies of suffixes ofstr
). :( \$\endgroup\$