Finding longest substring containing k distinct characters

Question 1

Below is my method to find the longest substring having k distinct characters. I have used Set to keep track of distinct characters.

public static int maxLengthSubstring(String str,int k) {
 int i=0;
 int j=i;
 int max=0;
 //use set to keep track of distinct characters
 Set<Integer> set=new TreeSet<>();
 char arr[] = str.toCharArray();
 int len = arr.length;
 while(i<len && j<len){
 int c=arr[j];
 set.add(c);
 //if the number of distinct characters exceed the given limit 
 //then compute the length of substring containing them
 if(set.size()>k){ 
 int substrLen=Math.abs(i-j);
 if(substrLen>max){
 //update the length of maximum substring found 
 max=substrLen;
 }
 //move to the next character in the array
 i=i+1;
 j=i;
 //clear the set containing previously found distinct characters 
 set.clear();
 continue;
 }
 //if set contains the exact number for distinct characters 
 //then store the length of the substring and move to next iteration
 if(set.size()==k){
 int substrLen = Math.abs(i-(j+1));
 if(substrLen>max){
 //update the length of maximum substring found 
 max=substrLen;
 }
 }
 j++;
 }
 return max;
}

The maximum length of input string can be up to million characters. The above method works fine for smaller strings but slows down if input string is large. How can I tweak my logic to improve running time for large strings?

A few test cases:

Input:

str="zxzxzxzxzx cvcvcvcvcvcvcvcvcv"
k=2
Output=18

Input:

str="aaaaaaaa abcabc aaaabbbbaaaabbbbbbabaa"
k=3
Output=23 (longest substring in this case is " aaaabbbbaaaabbbbbbabaa" including space character)

Question 2

Would it be clearer to say "longest substring containing no more than k distinct characters"?

Question 3

@200_success yes, you can say. Sorry if my question was not clear.

Question 4

This post has a detailed analysis of this question http://blog.gainlo.co/index.php/2016/04/12/find-the-longest-substring-with-k-unique-characters/

Question 5

//use set to keep track of distinct characters
Set<Integer> set=new TreeSet<>();

Use spaces, they're free. According to conventions, there should be spaces around each operator. If you want to use them more sparingly, it's your choice, but think about how they separate things. You code looks like composed of three things

Set<Integer> set=new TreeSet<>();

which is rather confusing.

You write "keep track of distinct characters" and declare Set<Integer>. Why not Set<Character>?

Don't use TreeSet unless you need it. HashSet is faster.

Don't call it set, when you can call it e.g. foundCharacters.

char arr[] = str.toCharArray();

Naming.

while(i<len && j<len){

Spacing. Always prefer "for" loops as they're much clearer. They also allow you to declare variables in a smaller scope. I'd go for

for (int i=0, j=0; i<len && j<len; ++j) ...

(add spaces to adhere to conventions), but you're modifying j somewhere in the middle (which is unexpected with a "for" loop). So feel free to stick with "while".

 int substrLen=Math.abs(i-j);

This looks like you lost track of what your variables do. I guess, j>=i always holds, but you should know. You may want to call them start and end.

 int substrLen = Math.abs(i-(j+1));

This makes me suspect. Why do you need a different expression for computing the same thing?

Optimizations

How can I tweak my logic to improve running time for large strings?

What the problem? You run through the string until you find out that they're too many characters. Then you throw away all information collected so far and start over at the next index. This implies a complexity of \$O(n^2)\,ドル which is way too slow for \$n=10^6\$.

When you find out that you collected too many distinct characters, you could move the start index until their number gets smaller again. For this, a Set is insufficient, as you have to know, when you get rid of the last occurrence of a given character.

So instead of remembering what characters you ran across, you need to count their occurrences. A Map<Character, Integer> would do. Guava Multiset <Character> would be even better as a multiset is just like a set but allowing duplicates.

But the best solution is an array. As there are just \2ドル^{16}\$ characters (let's ignore surrogates), you can keep track of them in new int[Character.MAX_VALUE+1]. This gives you a nice speed bonus of maybe factor 10 (it's just a bonus, it would not help you with the quadratic complexity and it's most probably not needed with the new algorithm).

Just a snippet to get you started:

private int distinctChars;
private int[] counts = new int[Character.MAX_VALUE+1];
private void addCharacter(char c) {
 if (counts[c] == 0) {
 distinctChars++;
 }
 counts[c]++;
}
private void removeCharacter(char c) {...}
public int maximumLength(String str, int limit) {
 ...
 for (int i=0, j=0; j<len; ) {
 if (distinctChars <= limit) {
 max = Math.max(max, j-i);
 addCharacter(str[j++]);
 } else {
 removeCharacter(str[i++]);
 }
 }
 ...
}

Question 6

One improvement would be to move the index i to the next unique character instead of the next character i=i+1. There's no point in moving forward by one if the character arr[i+1] == arr[i]. That is, set i=m for the first m>i such that a[m] != a[i].
Also, don't clear out set every time you find a match. You are recomputing information you've already computed once. Instead, just remove the oldest character arr[i] form set and keep going. You'll have to stop resetting j as well.

Bottom line, since you are looking for longest sub-string, when you find a match, just pop oldest character and start adding the newest characters without starting from scratch every time.

UPDATE

As pointed out, your set won't work for this sort of running approach. Instead of just a set you will have to keep a counter for each set element. So, you can then pop all the oldest characters until one of the element counters drops to 0.

Question 7

The set is not enough for keeping the information the way you propose. You can't simply remove the oldest character as the string may be something like abbbbbbabbbcbbb. When you encounter c, you get over the limit (assuming limit=2), but you can't remove a from the set as there's another one later.

maaartinus maaartinus 13.6k1 gold badge35 silver badges73 bronze badges · Accepted Answer · 2015-07-12 01:49:47Z

//use set to keep track of distinct characters
Set<Integer> set=new TreeSet<>();

Use spaces, they're free. According to conventions, there should be spaces around each operator. If you want to use them more sparingly, it's your choice, but think about how they separate things. You code looks like composed of three things

Set<Integer> set=new TreeSet<>();

which is rather confusing.

You write "keep track of distinct characters" and declare Set<Integer>. Why not Set<Character>?

Don't use TreeSet unless you need it. HashSet is faster.

Don't call it set, when you can call it e.g. foundCharacters.

char arr[] = str.toCharArray();

Naming.

while(i<len && j<len){

Spacing. Always prefer "for" loops as they're much clearer. They also allow you to declare variables in a smaller scope. I'd go for

for (int i=0, j=0; i<len && j<len; ++j) ...

(add spaces to adhere to conventions), but you're modifying j somewhere in the middle (which is unexpected with a "for" loop). So feel free to stick with "while".

 int substrLen=Math.abs(i-j);

This looks like you lost track of what your variables do. I guess, j>=i always holds, but you should know. You may want to call them start and end.

 int substrLen = Math.abs(i-(j+1));

This makes me suspect. Why do you need a different expression for computing the same thing?

Optimizations

How can I tweak my logic to improve running time for large strings?

What the problem? You run through the string until you find out that they're too many characters. Then you throw away all information collected so far and start over at the next index. This implies a complexity of \$O(n^2)\,ドル which is way too slow for \$n=10^6\$.

When you find out that you collected too many distinct characters, you could move the start index until their number gets smaller again. For this, a Set is insufficient, as you have to know, when you get rid of the last occurrence of a given character.

So instead of remembering what characters you ran across, you need to count their occurrences. A Map<Character, Integer> would do. Guava Multiset <Character> would be even better as a multiset is just like a set but allowing duplicates.

But the best solution is an array. As there are just \2ドル^{16}\$ characters (let's ignore surrogates), you can keep track of them in new int[Character.MAX_VALUE+1]. This gives you a nice speed bonus of maybe factor 10 (it's just a bonus, it would not help you with the quadratic complexity and it's most probably not needed with the new algorithm).

Just a snippet to get you started:

private int distinctChars;
private int[] counts = new int[Character.MAX_VALUE+1];
private void addCharacter(char c) {
 if (counts[c] == 0) {
 distinctChars++;
 }
 counts[c]++;
}
private void removeCharacter(char c) {...}
public int maximumLength(String str, int limit) {
 ...
 for (int i=0, j=0; j<len; ) {
 if (distinctChars <= limit) {
 max = Math.max(max, j-i);
 addCharacter(str[j++]);
 } else {
 removeCharacter(str[i++]);
 }
 }
 ...
}

Stack Exchange Network

Finding longest substring containing k distinct characters

2 Answers 2

Optimizations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Finding longest substring containing k distinct characters

2 Answers 2

Optimizations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions