algorithm for perfect substring temporarily replacement

Question 1

I have an interesting question about string and delimiters.

There is a random string str.

There are some symbols (OR SEQUENCES OF SYMBOLS) that are predefined and should be temporarily replaced within that string by replace function.

The operation should be like the javascript replace

Step1:

var str1 = str.replace(/abc/g, 'newchar1').replace(/,/g, 'newchar2'); // which replaces abcoccurencies and commas with newchar1 and newchar2 respectively.

Step2:

After some computations I do the "reverse" replacements by doing

str1.replace(/newchar1/g, 'abc').replace(/newchar2/g, ',');

and expect to have the same number of abc and commas and in the same places as before any changes.

As you understand there are problems with this method:

newchar1 may exist in the string before. and that's a problem.

That way I need to create a random hash for each delimiter (let's say 4-character random string) and check first it is not a part of an original string.
Moreover, the hashes for each delimiter should differ from each other.
Moreover, the hashes should not be the part of each other. Otherwise see what happens, if ("aa", "ac", "cd') - 3 generated hashes for ("abc", ",", "W") respectively. Good enough? Well.. Let's see... See what bad happens with this str string: aabcW,kkk ====step1====> aaacdackkk ====step2====> abc,d,kkk. Oh, unexpected result - different number of commas, lost info.

Do you see the problem?

So, what's the algorithm to properly generate, say, five hashes to replace five substrings within a string and making the reverse operation as described above without losing any info?

Any algorithm in javascript but fast would be fine. And I will provide my solution in a while. I guess there may be better ideas.

Question 2

Why not just persist the original string in order to reverse the operation?

Question 3

Some changes, replacements, updates are actually expected in the original string between Step1 and Step2. I decided it's beside the question description. That's why I didn't say about that. That replecement should not / won't touch the hashes (== delimiters == that special substrings).

Question 4

How can you be sure that "replacement" won't accidentally create the same sequence like you have generated? And if the replacement code won't touch anything inside == ==, then you can just use sequential numbers to denote escaped value, stored like var map = { "1": ''abc", "2": "," ...}. Then you can find all placeholders with simple regex ==\d+== to build the "new original" string using the stored map. Obviously, it assumes that "original" string should have no ==\d+== sequences of its own - if it has, then you would have to escape them regardless of the chosen approach.

Question 5

How can you be sure that "replacement" won't accidentally create the same sequence like you have generated? - Do you mean why I'm sure I don't get collisions by choosing replacements by the described algorythm? Thr anseer is: if aaa is not met in the string, then aaa1aaa is the replacement for the first symbol to be replaced multiple times. No more than two "a" from any side from "aaa1aaa" won't mess it because we will replace "aaa1aaa" back. It's not met in the string unless it was a replacement.

Question 6

@Haradzieniec I mean, what if you have a string aabb,data1,abc that is turned according to your algorithm into aabbaaa1aaadata1aaa1aaaccc1ccc and your replacement code replaces bb with a1? You will have aaa1aaa1aaadata1aaa1aaaccc1ccc that will be reversed to ,1aaadata1,abc - obviously incorrect result. And what about the == delimiters you mentioned? Also, what are you trying to accomplish with that "replacement" code? Perhaps there is a much simpler solution to your actual problem.

Question 7

So, here is my answer for What's the algorithm to properly generate let's call five hashes to replace five substrings within a string and making the reverse operation as described above without losing any info?

Let's give a try...

Create the shortest "aaaa...a" string that doesn't occur within an original str string. Let it be "aaaaaa" (six "a").

Then add any symbol. Let it be "1". The first "hash" for the string is ready: "aaaaaa1aaaaaa".

The second would be "bb1bb" (or "bb2bb" or whatever in the middle) if no bb mets in the string.

The third would be "ccc1ccc" (or "ccc9ccc" or whatever in the middle) if no ccc mets in the string (otherwise, if ccc mets in the string, then check for occurence of four cccc and use "cccc9cccc").

The fours would be "ddddddddd1ddddddddd"

And the fifth would be "e1e" (if no "e" characters met in the str at all).

Possible questions/comments for my algorithm above:

Why do I need "aaaaaa1aaaaaa" but not "aaaaaa"? that's because "a" may be met right before the first "delimiter". I don't want to grab upon the reverse step the first a but not last from "aaaaaa". That's why I suggest "aaaaaa1aaaaaa".
To make these hashes shorter, two letters (not one) may be used to create hashes: let's say any combination of (a, b) letters for the first "hash", (c, d) for the second hash and so on. The hashes expected to take less bytes at average.

Do you like my algorithm or I'm missing something? And it should be pretty fast in JavaScript to generate.

Any comments are highly appreciated. Thank you.

Question 8

I would suggest a different approach. There are two Unicode characters in the Basic Multilingual Plane (BMP), U+FFFE and U+FFFF, which are reserved, do not represent actual characters, and which Unicode says are intended for process-internal uses.

Since no valid Unicode string can contain them, you can remove any that are present in your input, use them (or combinations of them) as replacements, and then substitute the original strings for them at the end without concern that there are valid occurrences of them in the string.

There are other Unicode noncharacters that could also be used. See the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ for more information. Note that if you do use noncharacters you must only use them internally; they must be stripped out of the text before you save it, send it on, or otherwise pass it to someone else.

Question 9

With utf-16 you have 1,112,064 characters. So any string shorter than a million characters will give you over a hundred thousand possible replacement characters.

In general I don't think it is possible to make unique placeholders simply by adding character to them. There is always be a string which when converted gives an ambiguous result.

Question 10

Good point. Thank you. However (yes, just the theory), what if the string is even rather longer? Will this algorithm work faster than mine for long strings?

Question 11

i dont believe your algorithm works ',1,' -> 'a1a1a1a'

Question 12

I understand what you mean but nope, it works: 'a1a1a1a'.replace(/a1a/g,",") (replace all occurrences of "a1a" with "," in JavaScript). The result will be EXACTLY the original str ",1,".

Question 13

Any examples when my algorithm does NOT work are appreciated with no jokes.

Haradzieniec Haradzieniec 1195 bronze badges · Answer 1 · 2017-02-24 15:07:56Z

So, here is my answer for What's the algorithm to properly generate let's call five hashes to replace five substrings within a string and making the reverse operation as described above without losing any info?

Let's give a try...

Create the shortest "aaaa...a" string that doesn't occur within an original str string. Let it be "aaaaaa" (six "a").

Then add any symbol. Let it be "1". The first "hash" for the string is ready: "aaaaaa1aaaaaa".

The second would be "bb1bb" (or "bb2bb" or whatever in the middle) if no bb mets in the string.

The third would be "ccc1ccc" (or "ccc9ccc" or whatever in the middle) if no ccc mets in the string (otherwise, if ccc mets in the string, then check for occurence of four cccc and use "cccc9cccc").

The fours would be "ddddddddd1ddddddddd"

And the fifth would be "e1e" (if no "e" characters met in the str at all).

Possible questions/comments for my algorithm above:

Why do I need "aaaaaa1aaaaaa" but not "aaaaaa"? that's because "a" may be met right before the first "delimiter". I don't want to grab upon the reverse step the first a but not last from "aaaaaa". That's why I suggest "aaaaaa1aaaaaa".
To make these hashes shorter, two letters (not one) may be used to create hashes: let's say any combination of (a, b) letters for the first "hash", (c, d) for the second hash and so on. The hashes expected to take less bytes at average.

Do you like my algorithm or I'm missing something? And it should be pretty fast in JavaScript to generate.

Any comments are highly appreciated. Thank you.

David Conrad David Conrad 8649 silver badges10 bronze badges · Answer 2 · 2017-02-25 18:12:38Z

I would suggest a different approach. There are two Unicode characters in the Basic Multilingual Plane (BMP), U+FFFE and U+FFFF, which are reserved, do not represent actual characters, and which Unicode says are intended for process-internal uses.

Since no valid Unicode string can contain them, you can remove any that are present in your input, use them (or combinations of them) as replacements, and then substitute the original strings for them at the end without concern that there are valid occurrences of them in the string.

There are other Unicode noncharacters that could also be used. See the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ for more information. Note that if you do use noncharacters you must only use them internally; they must be stripped out of the text before you save it, send it on, or otherwise pass it to someone else.

Ewan Ewan 83.9k5 gold badges90 silver badges187 bronze badges · Answer 3 · 2017-02-24 16:51:36Z

0

With utf-16 you have 1,112,064 characters. So any string shorter than a million characters will give you over a hundred thousand possible replacement characters.

In general I don't think it is possible to make unique placeholders simply by adding character to them. There is always be a string which when converted gives an ambiguous result.

Share

Improve this answer

edited Feb 25, 2017 at 9:47

answered Feb 24, 2017 at 16:51

Ewan's user avatar

Ewan Ewan

83.9k5 gold badges90 silver badges187 bronze badges

4

Good point. Thank you. However (yes, just the theory), what if the string is even rather longer? Will this algorithm work faster than mine for long strings?

Haradzieniec
– Haradzieniec

2017年02月24日 23:13:29 +00:00
Commented Feb 24, 2017 at 23:13
i dont believe your algorithm works ',1,' -> 'a1a1a1a'

Ewan
– Ewan

2017年02月25日 09:37:40 +00:00
Commented Feb 25, 2017 at 9:37
I understand what you mean but nope, it works: 'a1a1a1a'.replace(/a1a/g,",") (replace all occurrences of "a1a" with "," in JavaScript). The result will be EXACTLY the original str ",1,".

Haradzieniec
– Haradzieniec

2017年02月25日 11:50:29 +00:00
Commented Feb 25, 2017 at 11:50
Any examples when my algorithm does NOT work are appreciated with no jokes.

Haradzieniec
– Haradzieniec

2017年02月25日 11:52:38 +00:00
Commented Feb 25, 2017 at 11:52

Add a comment |

Stack Exchange Network

algorithm for perfect substring temporarily replacement

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

algorithm for perfect substring temporarily replacement

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions