Below is my method to find the longest substring having k
distinct characters.
I have used Set
to keep track of distinct characters.
public static int maxLengthSubstring(String str,int k) {
int i=0;
int j=i;
int max=0;
//use set to keep track of distinct characters
Set<Integer> set=new TreeSet<>();
char arr[] = str.toCharArray();
int len = arr.length;
while(i<len && j<len){
int c=arr[j];
set.add(c);
//if the number of distinct characters exceed the given limit
//then compute the length of substring containing them
if(set.size()>k){
int substrLen=Math.abs(i-j);
if(substrLen>max){
//update the length of maximum substring found
max=substrLen;
}
//move to the next character in the array
i=i+1;
j=i;
//clear the set containing previously found distinct characters
set.clear();
continue;
}
//if set contains the exact number for distinct characters
//then store the length of the substring and move to next iteration
if(set.size()==k){
int substrLen = Math.abs(i-(j+1));
if(substrLen>max){
//update the length of maximum substring found
max=substrLen;
}
}
j++;
}
return max;
}
The maximum length of input string can be up to million characters. The above method works fine for smaller strings but slows down if input string is large. How can I tweak my logic to improve running time for large strings?
A few test cases:
Input:
str="zxzxzxzxzx cvcvcvcvcvcvcvcvcv"
k=2
Output=18
Input:
str="aaaaaaaa abcabc aaaabbbbaaaabbbbbbabaa"
k=3
Output=23 (longest substring in this case is " aaaabbbbaaaabbbbbbabaa" including space character)
-
2\$\begingroup\$ Would it be clearer to say "longest substring containing no more than k distinct characters"? \$\endgroup\$200_success– 200_success2015年07月12日 05:10:42 +00:00Commented Jul 12, 2015 at 5:10
-
\$\begingroup\$ @200_success yes, you can say. Sorry if my question was not clear. \$\endgroup\$Nishant– Nishant2015年07月12日 07:03:38 +00:00Commented Jul 12, 2015 at 7:03
-
\$\begingroup\$ This post has a detailed analysis of this question http://blog.gainlo.co/index.php/2016/04/12/find-the-longest-substring-with-k-unique-characters/ \$\endgroup\$user102691– user1026912016年04月12日 15:28:28 +00:00Commented Apr 12, 2016 at 15:28
2 Answers 2
//use set to keep track of distinct characters
Set<Integer> set=new TreeSet<>();
Use spaces, they're free. According to conventions, there should be spaces around each operator. If you want to use them more sparingly, it's your choice, but think about how they separate things. You code looks like composed of three things
Set<Integer>
set=new
TreeSet<>();
which is rather confusing.
You write "keep track of distinct characters" and declare Set<Integer>
. Why not Set<Character>
?
Don't use TreeSet
unless you need it. HashSet
is faster.
Don't call it set
, when you can call it e.g. foundCharacters
.
char arr[] = str.toCharArray();
Naming.
while(i<len && j<len){
Spacing. Always prefer "for" loops as they're much clearer. They also allow you to declare variables in a smaller scope. I'd go for
for (int i=0, j=0; i<len && j<len; ++j) ...
(add spaces to adhere to conventions), but you're modifying j
somewhere in the middle (which is unexpected with a "for" loop). So feel free to stick with "while".
int substrLen=Math.abs(i-j);
This looks like you lost track of what your variables do. I guess, j>=i
always holds, but you should know. You may want to call them start
and end
.
int substrLen = Math.abs(i-(j+1));
This makes me suspect. Why do you need a different expression for computing the same thing?
Optimizations
How can I tweak my logic to improve running time for large strings?
What the problem? You run through the string until you find out that they're too many characters. Then you throw away all information collected so far and start over at the next index. This implies a complexity of \$O(n^2)\,ドル which is way too slow for \$n=10^6\$.
When you find out that you collected too many distinct characters, you could move the start index until their number gets smaller again. For this, a Set
is insufficient, as you have to know, when you get rid of the last occurrence of a given character.
So instead of remembering what characters you ran across, you need to count their occurrences. A Map<Character, Integer>
would do. Guava Multiset
<Character>
would be even better as a multiset is just like a set but allowing duplicates.
But the best solution is an array. As there are just \2ドル^{16}\$ characters (let's ignore surrogates), you can keep track of them in new int[Character.MAX_VALUE+1]
. This gives you a nice speed bonus of maybe factor 10 (it's just a bonus, it would not help you with the quadratic complexity and it's most probably not needed with the new algorithm).
Just a snippet to get you started:
private int distinctChars;
private int[] counts = new int[Character.MAX_VALUE+1];
private void addCharacter(char c) {
if (counts[c] == 0) {
distinctChars++;
}
counts[c]++;
}
private void removeCharacter(char c) {...}
public int maximumLength(String str, int limit) {
...
for (int i=0, j=0; j<len; ) {
if (distinctChars <= limit) {
max = Math.max(max, j-i);
addCharacter(str[j++]);
} else {
removeCharacter(str[i++]);
}
}
...
}
- One improvement would be to move the index
i
to the next unique character instead of the next characteri=i+1
. There's no point in moving forward by one if the characterarr[i+1] == arr[i]
. That is, seti=m
for the firstm>i
such thata[m] != a[i]
. - Also, don't clear out
set
every time you find a match. You are recomputing information you've already computed once. Instead, just remove the oldest characterarr[i]
formset
and keep going. You'll have to stop resettingj
as well.
Bottom line, since you are looking for longest sub-string, when you find a match, just pop oldest character and start adding the newest characters without starting from scratch every time.
UPDATE
As pointed out, your set won't work for this sort of running approach. Instead of just a set you will have to keep a counter for each set element. So, you can then pop all the oldest characters until one of the element counters drops to 0.
-
\$\begingroup\$ The
set
is not enough for keeping the information the way you propose. You can't simply remove the oldest character as the string may be something likeabbbbbbabbbcbbb
. When you encounterc
, you get over the limit (assuminglimit=2
), but you can't removea
from the set as there's another one later. \$\endgroup\$maaartinus– maaartinus2015年07月12日 02:42:53 +00:00Commented Jul 12, 2015 at 2:42