I've been working on a task: I have two giant strings, both consist of the same characters just scrambled (both consist only capital english letters). The task is to find the lowest possible number of changes you can make to turn the first string in the other one, while 1 change = switching neighbour chars in the string. I found a solution that works just fine but there is a problem.
It works under 5 seconds only for input of about 100 000 char String
s. I need to make it work for up to 1000 000 char. I tried ArrayList
, LinkedList
, regular arrays, substrings and different variations of the algorithm. This one is the best so far I came up with but I'm out of ideas. Any help? Is there a faster collection I can use? Maybe the algorithm is wrong here?
"steps" int is the output:
String nazwiskoJas = fileInput.nextLine();
String nazwiskoMal = fileInput.nextLine();
ArrayList<Character> jas = new ArrayList();
ArrayList<Character> mal = new ArrayList();
for(int i=0;i<charNumber;i++) {
jas.add(nazwiskoJas.charAt(i));
mal.add(nazwiskoMal.charAt(i));
}
fileInput.close();
int steps=0;
int index=0;
while(jas.size()>1) {
if(jas.get(0)!=mal.get(index)) {
int distance = jas.indexOf(mal.get(index));
jas.remove(distance);
steps+=distance;
}else {
jas.remove(0);
}
index++;
}
System.out.println(steps);
1 Answer 1
I would suggest making jas
, but not mal
, a LinkedList
. You only perform get
operations on mal
, and get
is O(1) with an ArrayList
. Removal operations, however, can be very costly with an ArrayList
, because all the subsequent elements have to be rearranged. The worst-case-scenario is, of course, if you remove the first element of the list, which you are doing here. So jas
would benefit a lot from becoming a LinkedList
. The trick is that removing an element from a LinkedList
is only O(1) if you do it via Iterator.remove()
instead of one of the two remove
methods defined in List
. So instead of calling jas.indexOf(Object)
and then calling jas.remove(int)
, I would manually create an Iterator
by calling jas.iterator()
and use this iterator to find the first occurrence of mal.get(index)
in jas
. If you have found it, you can simply call remove()
on the iterator and remove the element from jas
in O(1).
Here is a summary of the advantages and disadvantages of ArrayList
and LinkedList
.
In fact, mal
does not need to be a List
at all. All you are doing with mal
is retrieving a character at a specific index, and you can do that with a String
too. The time complexity of String.charAt(int)
is O(1), just like that of ArrayList.get(int)
, because String
uses a byte[]
internally to store the characters (or apparently a char[]
before Java 9), so it should not be noticeably slower than ArrayList.get(int)
, and since you would not need to create the ArrayList
mal
from the String
nazwiskoMal
in the first place, you can even save a bit of performance.
Apart from that, here are some other suggestions for your code:
You don't actually need the
if
-else
construct. If you deleted theelse
clause and just unconditionally executed the code inside theif
clause, the effect would be the same, but the code would be a bit more compact.Declare the two lists as
List
s (interface), not as their implementation (ArrayList
orLinkedList
). Their usage doesn't depend on their implementation, only on their functionality. So instead, I would write:List<Character> jas = new LinkedList<>(); List<Character> mal = new ArrayList<>();
By the way, you instantiated two raw types. This is not a big deal here because you used the no-argument-constructor, which means that there is no way for the
List
to contain any elements of another type thanCharacter
, but just for the sake of clarity, I would use the diamond operator in the instantiation of the two lists.
Edit
Here is a code sample to illustrate what I mean by using an iterator. By the way, I replaced your while(jas.size()>1)
loop with a for
loop, which I think is easier to read because it limits the scope of index
to the loop itself, whereas in your code, index
unnecessarily lives on even after the loop terminates.
int steps = 0;
for (int index = 0; index < nazwiskoMal.length() - 1; index++) {
int distance = 0;
Iterator<Character> jasIterator = jas.iterator();
while (jasIterator.hasNext()) {
if (!jasIterator.next().equals(nazwiskoMal.charAt(index))) {
distance++;
} else {
steps += distance;
jasIterator.remove();
break;
}
}
}
This a bit more complicated to read than your code, because you have to manually find the element and keep track of the index (which you need to calculate the number of steps), but on the other hand, you only have to iterate over jas
once per character from nazwiskoMal
because the removal is accomplished via Iterator.remove()
, whereas your code would require two (implicit) iterations if you make jas
a LinkedList
instead of an ArrayList
: one for finding the first occurrence of the character, and a second iteration for removing it via List.remove(int)
(if jas
is an ArrayList
, then List.remove(int)
would be able to find the element in constant time without needing to iterate over the list, but the actual deletion would, as already mentioned, require all subsequent elements to be moved, which is not the case with a LinkedList
).
Update
Inspired by your idea not to reset the iterator over jas
when there are consecutive identical characters in mal
, I've tried to apply this principle whenever the next character in mal
has not yet occurred in jas
during the last iteration over jas
, and not only when the next character in mal
is identical to the last character from mal
.
The trick was making the check whether a character has already occurred in jas
cheap enough so that the savings in loop iterations are not outweighed by the overhead of the check itself. I originally tried putting the characters encountered during an iteration over jas
in a HashSet
, which would be the easiest solution, but this turned out to be far too slow to be worth it. Then I tried using a boolean
array that contains a value for every possible character that signifies whether this character has already occurred in jas
, and this did indeed speed up the program, not in a groundbreaking manner, but definitely noticeable:
public static long calculateSteps(String nazwiskoJas, String nazwiskoMal) {
List<Character> jas = new LinkedList<>();
for (int i = 0; i < nazwiskoJas.length(); i++) {
jas.add(nazwiskoJas.charAt(i));
}
char lowestChar = nazwiskoMal.charAt(0);
char highestChar = nazwiskoMal.charAt(0);
for (int i = 1; i < nazwiskoMal.length(); i++) {
lowestChar = (char) Math.min(nazwiskoMal.charAt(i), lowestChar);
highestChar = (char) Math.max(nazwiskoMal.charAt(i), highestChar);
}
ListIterator<Character> jasIterator = jas.listIterator();
boolean[] characterHasBeenEncountered = new boolean[highestChar - lowestChar + 1];
Arrays.fill(characterHasBeenEncountered, false);
long steps = 0;
for (int index = 0; index < nazwiskoMal.length() - 1; index++) {
if (characterHasBeenEncountered[nazwiskoMal.charAt(index) - lowestChar]) {
jasIterator = jas.listIterator();
Arrays.fill(characterHasBeenEncountered, false);
}
Character nextCharacterInJas;
while (!(nextCharacterInJas = jasIterator.next()).equals(nazwiskoMal.charAt(index))) {
characterHasBeenEncountered[nextCharacterInJas - lowestChar] = true;
}
jasIterator.remove();
steps += jasIterator.nextIndex();
}
return steps;
}
I tried an even hackier approach using a single int
variable as a bitmask instead of a boolean[]
in hopes that setting an int
to 0
would be faster than filling a boolean[]
of 26 elements with false
. This was indeed a bit faster than a boolean[]
, but not much, and of course, it has the drawback that it only works for up two 32 different characters (or 64 if you use a long
).
I also replaced the Iterator
with a ListIterator
, because by taking advantage of the method ListIterator.nextIndex()
, the variable distance
becomes obsolete.
Finally, you said that you need the program to work for strings of up to 1,000,000 characters, and while experimenting, I noticed that the number of steps for randomly generated and scrambled strings of length 1,000,000 containing only the characters A
to Z
were getting dangerously close to Integer.MAX_VALUE
. While the number of steps for randomly generated strings so far have not exceeded Integer.MAX_VALUE
, I constructed an extreme example where the result would indeed not fit in an int
. Suppose you have a string that starts with 38461 A's, followed by 38461 B's, then 38461 C's etc., and the scrambled version would simply be the reverse of this string, i.e. 38461 Z's, followed by 38461 Y's etc. The string would have a length of 999986, and the number of changes needed to turn one into the other would be 480,755,769,325, which is greater than Integer.MAX_VALUE
(2,147,483,647) by far. Ironically, the algorithm runs in under one second for this special case with the optimizations for repeated/already encountered characters, while without these optimizations, it seems to take forever (I've stopped the program after 40 minutes or so).
But seriously, I really doubt that it's possible to squeeze any more performance out of this algorithm, at least in Java. Maybe using a language that's not interpreted by a virtual machine but directly compiled to machine code would make the program faster, but I have no idea whether this is really true in this case. I don't know anything about this, but I've read that compilers today are so optimized that even a program in an interpreted language does not necessarily run slower than if it were written a compiled language. But this probably depends on the program, the language, the compiler and other things and cannot be generalized. Nevertheless, if the programm is still too slow, it might be worth a try to use a different language altogether, although I have no idea what language could be more suitable for this, and if it would actually make a significant difference.
-
\$\begingroup\$ first THANK YOU so much fo this in depth anserw and tips, you are amazing and it works much faster now just having changed jas to LinkedList. Im still kinda confused on the use of Iterator though. Do I have to use a loop to go through all the list objects? I used a while loop iterating steps that removes unwanted object after bumping into it. Still doesnt seem to work any faster that way? Am I missing something? \$\endgroup\$Mgols– Mgols2018年04月19日 10:22:22 +00:00Commented Apr 19, 2018 at 10:22
-
\$\begingroup\$ @MarekGołębiewski I'm glad it helped. Regarding the usage of an iterator: I've updated the answer to include a code sample to demonstrate what I mean. \$\endgroup\$Stingy– Stingy2018年04月19日 11:20:34 +00:00Commented Apr 19, 2018 at 11:20
-
\$\begingroup\$ @MarekGołębiewski By the way, I'm not sure what you meant in your last comment with your description of what you did with the iterator. Maybe you understood my suggestion correctly and it simply didn't make much of a difference in performance. I haven't tested it, so I have no idea how much effect using an iterator and removing elements via
iterator.remove()
will have on the performance, but I guess the impact will be much smaller than making ajas
aLinkedList
instead of anArrayList
. If you try it, I'd be curious to know whether it makes any significant difference. \$\endgroup\$Stingy– Stingy2018年04月19日 13:27:46 +00:00Commented Apr 19, 2018 at 13:27 -
\$\begingroup\$ thank you again very much for your effort. Turned out I implemented the iterator in simillar way you showed me in your edit. Still doesnt seem to speed things up unfortunetley. I was thinking maybe i'll make the program search for equal neighbouring characters and remove them all at one loop pass instead of repeating the process for every single one. for example if the 'mal' String is something like:"OUBGAWDDDDDAPWE" than instead of removing one 'D' from 'jas' , try to remove next 5 'D' apperances. Not sure if you can undestand what I'm trying to say and if this would speed it up. \$\endgroup\$Mgols– Mgols2018年04月19日 13:50:52 +00:00Commented Apr 19, 2018 at 13:50
-
\$\begingroup\$ @MarekGołębiewski Good idea, although I doubt that this will make much difference, because if halving the number of iterations over
jas
doesn't speed up the program, I don't think saving some additional extra iterations in the case of consecutive identical characters will make much of a difference. \$\endgroup\$Stingy– Stingy2018年04月19日 14:48:19 +00:00Commented Apr 19, 2018 at 14:48
Explore related questions
See similar questions with these tags.
ArrayList
is slow, all the elements after it are moved usually. \$\endgroup\$