I have an interesting question about string and delimiters.
There is a random string str
.
There are some symbols (OR SEQUENCES OF SYMBOLS) that are predefined and should be temporarily replaced within that string by replace
function.
The operation should be like the javascript replace
- Step1:
var str1 = str.replace(/abc/g, 'newchar1').replace(/,/g, 'newchar2');
// which replaces abc
occurencies and commas with newchar1
and newchar2
respectively.
- Step2:
After some computations I do the "reverse" replacements by doing
str1.replace(/newchar1/g, 'abc').replace(/newchar2/g, ',');
and expect to have the same number of abc and commas and in the same places as before any changes.
As you understand there are problems with this method:
newchar1 may exist in the string before. and that's a problem.
That way I need to create a random hash for each delimiter (let's say 4-character random string) and check first it is not a part of an original string.
Moreover, the hashes for each delimiter should differ from each other.
Moreover, the hashes should not be the part of each other. Otherwise see what happens, if ("aa", "ac", "cd') - 3 generated hashes for ("abc", ",", "W") respectively. Good enough? Well.. Let's see... See what bad happens with this
str
string:aabcW,kkk ====step1====> aaacdackkk ====step2====> abc,d,kkk.
Oh, unexpected result - different number of commas, lost info.
Do you see the problem?
So, what's the algorithm to properly generate, say, five hashes to replace five substrings within a string and making the reverse operation as described above without losing any info?
Any algorithm in javascript but fast would be fine. And I will provide my solution in a while. I guess there may be better ideas.
3 Answers 3
So, here is my answer for What's the algorithm to properly generate let's call five hashes to replace five substrings within a string and making the reverse operation as described above without losing any info?
Let's give a try...
Create the shortest "aaaa...a" string that doesn't occur within an original str
string. Let it be "aaaaaa" (six "a").
Then add any symbol. Let it be "1". The first "hash" for the string is ready: "aaaaaa1aaaaaa".
The second would be "bb1bb" (or "bb2bb" or whatever in the middle) if no bb mets in the string.
The third would be "ccc1ccc" (or "ccc9ccc" or whatever in the middle) if no ccc mets in the string (otherwise, if ccc mets in the string, then check for occurence of four cccc and use "cccc9cccc").
The fours would be "ddddddddd1ddddddddd"
And the fifth would be "e1e" (if no "e" characters met in the str
at all).
Possible questions/comments for my algorithm above:
Why do I need "aaaaaa1aaaaaa" but not "aaaaaa"? that's because "a" may be met right before the first "delimiter". I don't want to grab upon the reverse step the first a but not last from "aaaaaa". That's why I suggest "aaaaaa1aaaaaa".
To make these hashes shorter, two letters (not one) may be used to create hashes: let's say any combination of (a, b) letters for the first "hash", (c, d) for the second hash and so on. The hashes expected to take less bytes at average.
Do you like my algorithm or I'm missing something? And it should be pretty fast in JavaScript to generate.
Any comments are highly appreciated. Thank you.
I would suggest a different approach. There are two Unicode characters in the Basic Multilingual Plane (BMP), U+FFFE and U+FFFF, which are reserved, do not represent actual characters, and which Unicode says are intended for process-internal uses.
Since no valid Unicode string can contain them, you can remove any that are present in your input, use them (or combinations of them) as replacements, and then substitute the original strings for them at the end without concern that there are valid occurrences of them in the string.
There are other Unicode noncharacters that could also be used. See the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ for more information. Note that if you do use noncharacters you must only use them internally; they must be stripped out of the text before you save it, send it on, or otherwise pass it to someone else.
With utf-16 you have 1,112,064 characters. So any string shorter than a million characters will give you over a hundred thousand possible replacement characters.
In general I don't think it is possible to make unique placeholders simply by adding character to them. There is always be a string which when converted gives an ambiguous result.
-
Good point. Thank you. However (yes, just the theory), what if the string is even rather longer? Will this algorithm work faster than mine for long strings?Haradzieniec– Haradzieniec2017年02月24日 23:13:29 +00:00Commented Feb 24, 2017 at 23:13
-
i dont believe your algorithm works ',1,' -> 'a1a1a1a'Ewan– Ewan2017年02月25日 09:37:40 +00:00Commented Feb 25, 2017 at 9:37
-
I understand what you mean but nope, it works: 'a1a1a1a'.replace(/a1a/g,",") (replace all occurrences of "a1a" with "," in JavaScript). The result will be EXACTLY the original str ",1,".Haradzieniec– Haradzieniec2017年02月25日 11:50:29 +00:00Commented Feb 25, 2017 at 11:50
-
Any examples when my algorithm does NOT work are appreciated with no jokes.Haradzieniec– Haradzieniec2017年02月25日 11:52:38 +00:00Commented Feb 25, 2017 at 11:52
== ==
, then you can just use sequential numbers to denote escaped value, stored likevar map = { "1": ''abc", "2": "," ...}
. Then you can find all placeholders with simple regex==\d+==
to build the "new original" string using the storedmap
. Obviously, it assumes that "original" string should have no==\d+==
sequences of its own - if it has, then you would have to escape them regardless of the chosen approach.aabb,data1,abc
that is turned according to your algorithm intoaabbaaa1aaadata1aaa1aaaccc1ccc
and your replacement code replacesbb
witha1
? You will haveaaa1aaa1aaadata1aaa1aaaccc1ccc
that will be reversed to,1aaadata1,abc
- obviously incorrect result. And what about the==
delimiters you mentioned? Also, what are you trying to accomplish with that "replacement" code? Perhaps there is a much simpler solution to your actual problem.