Expanding compressed string

Question 1

I wrote this for a problem in one of the online contest at hackerearth. Problem statement is:

Input:

The first line contains the compressed string in form a2b4h5a2

Expanding it means aabbbbhhhhhaa.

You have to sort the expansion in increasing order(a-z). ie. aaaabbbbhhhhh

This is followed by Q that indicates the number of queries.

Follows Q lines each line contain an integer K.

Output:

For every question output the Kth character if it exists else print -1 .

It works correctly as intended, but help me make it faster as it exceeds time limit for that portal. Any constructive criticism and advice is appreciated. :)

The idea is that, I made a integer array of size 26 to store the respective frequency of characters, thus sorting step is done automatically. Then just started subtracting the frequency value from the query until its zero or negative, that index is the required alphabet.

#include<stdio.h>
int main() 
{
 // made a array to store frequency of alphabets 
 int ar[26] = {0}; 
 int n=0, i=0, c=0, Q=0, K=0, m=0, chint=0;
 char str[10000]; // use to store the compressed format string
 scanf("%10000[^\n]", str);
 for(i=0;str[i];i=i++)
 {
 if((str[i]>='a' && str[i]<='z')) 
 {
 ar[m] += n;
 m = str[i]-97;
 n=0;
 }
 else
 {
 chint=(int)str[i]-48;
 n=(n*10)+chint;
 }
 }
 ar[m] += n;
 scanf("%d",&Q);
 while(Q--)
 {
 scanf("%d",&K);
 for(i = 0; i<26; i++)
 {
 K = K - ar[i];
 if(K<=0)
 {
 printf("%c\n", (i+97));
 break;
 }
 }
 if(K>0)
 printf("-1\n");
 }
 return 0;
}

Question 2

Can you link the actual challenge in your post? You have left elements of the compressed format for us to guess. From your code it appears the compressed format is value-count where count is expressed as an ASCII digit and you can have several count digits in a row. This contrasts with the typical run-length encoding.

Question 3

@twohundredping Well it's a part of hiring challenge still going on so i cant give u direct link to problem. [the challenge link]( hackerearth.com/nutanix-backend-hiring-challenge). Its one of the 2 coding problems in the end.

Question 4

Is there numbers with more than one digit?

Question 5

@WashingtonGuedes Yes there is, input can be of form a345b654c908. the frequency can be anywhere between 0-10^7. But total buffer size will be less than 10^4 characters combined that is size or str[].

Question 6

Warning

Enabling your compiler's warnings would tell you that c is unused.

Variable declarations

Following previous comments, I consider it good practice to declare your variable in the smallest possible scope and as late as possible. This makes things clearer (and prevent you from having unused variables). Using a "recent" version of C allows you to declare your variable in the for loop.

While/for

Your loop over Q is written as a while loop but could be written as a for loop.

i=i++

This is pretty weird code and according to my compiler warnings, might be undefined behavior.

expand.c:10:25: warning: operation on ‘i’ may be undefined [-Wsequence-point]

More about this here.

Ascii values

You are mixing literal characters (like 'a') that are easy to understand and ascii values (like '97') for no obvious reason. I'd rather have the literal characters used everywhere. It makes clearer the fact that we try to convert chars to indices in once case and to numbers in the other case.

Using temp variable

It would probably make sense to use a temporary variable for str[i] in order to make things clearer.

Variable names

The variable names you are using could probably be improved. Among other things m seems pretty cryptic to me.

Draft

At this stage, I have the following code:

#include<stdio.h>
int main()
{
 int ar[26] = {0};
 char str[10000];
 scanf("%10000[^\n]", str);
 int last_char = 0;
 int nb_occ = 0;
 for(int i=0;str[i];i=i++)
 {
 int c = str[i];
 if((c>='a' && c<='z'))
 {
 ar[last_char] += nb_occ;
 last_char = c-'a';
 nb_occ=0;
 }
 else
 {
 int chint=(int)c-'0';
 nb_occ=(nb_occ*10)+chint;
 }
 }
 ar[last_char] += nb_occ;
 int Q = 0;
 scanf("%d",&Q);
 while(Q--)
 {
 int K = 0;
 scanf("%d",&K);
 for(int i = 0; i<26; i++)
 {
 K = K - ar[i];
 if(K<=0)
 {
 printf("%c\n", (i+97));
 break;
 }
 }
 if(K>0)
 printf("-1\n");
 }
 return 0;
}

Code organisation

On this kind of task, it is usually worth trying to split the input/output logic from the actual computing logic. Here, you could define 2 functions : one to initialise some object like structure from a string, one to use the structure to get the Kth character (K being the argument to the method).

This would allow you to write unit tests which would help you improve your algorithm.

Algorithm

You have used the correct ideas : use the proper data structure to store the different values used.

An improvement you could implement is to define a structure with the cumulated numbers that you compute once so that then you can get the kth element with a simple array lookup (actually as suggested by ChrisWue, maybe you'd need a binary search and not a single lookup). When multiple queries are performed (which is I think what is happening in your case), this should make things much faster.

The idea of doing some ("long") preprocessing once to make (multiple) queries much faster is a pretty classic solution in real life and in programming challenges.

Question 7

Thanks, I tried changing my array to hold running sum as suggested by @ChrisWue and did binary search on it, but to success. Do you think using any other structure (tree) might speed it up or how can preprocessor work here.

Question 8

I see some things that may help you improve your code.

Use more whitespace

Lines like this one:

for(i=0;str[i];i=i++)

are harder to read than they should be. Use whitespace to make the code easier to read and understand:

for(i = 0; str[i]; i = i++)

Use C idioms

The for loop above isn't technically incorect, but it's odd-looking for those who are fluent in C. The issue is that this i = i++ should just be written as i++.

Eliminate unused variables

This code declares a variable c but then does nothing with it. Your compiler is smart enough to help you find this kind of problem if you know how to ask it to do so.

Eliminate "magic numbers"

This code has a number of "magic numbers," that is, unnamed constants such as 26, 48, 10000, etc. Generally it's better to avoid that and give such constants meaningful names. That way, if anything ever needs to be changed, you won't have to go hunting through the code for all instances of "3" and then trying to determine if this particular 3 is relevant to the desired change or if it is some other constant that happens to have the same value.

Since you're using 10000 as both the size of the buffer and as part of the format string for scanf, it's a little bit tricky, but there's a nice way to do it in C:

#define BUFFLEN 10000
#define XSTR(x) STR(x)
#define STR(x) #x

Now when you want to use BUFFLEN as a number, it can be done like this:

char str[BUFFLEN];

Using it in the format string is a little different:

scanf("%" XSTR(BUFFLEN) "[^\n]", str);

See the gcc explanation of stringification for details on how and why this works.

Use constants consistently

In one case the code has str[i] >= 'a' and in another it uses m = str[i] - 97. It's much more clear if the latter were written instead as m = str[i] - 'a'.

Use better variable names

The variable name str is OK, but the names n, Q and chint are not. The first name explains something about what the variable means within the context of the code, but the others provide no hint as to what the variable is storing or why it might be useful.

Consider using a tree structure

Ultimately, what your program needs to do is to take an input number and produce the corresponding number. The current code has the right idea in that it uses a data structure rather than an explicitly uncompressed string, but there are potentially better ways to store the data. Consider using a tree structure instead (or in addition to) the existing structure. Each node could store the cumulative count and the associated letter, so finding the answer would simply be a matter of searching the tree.

For example, the given example string translates to aaaabbbbhhhhh. That could be represented by the following binary search tree:

binary search tree example

If the query is "what character is at position \$n\$`, we simply search the tree. For example, if \$n=9\,ドル we start at the root node and look for a node with the smallest number \$\ge n\$. In the case that \$n=9\,ドル we see that \9ドル > 8\,ドル so move the right child node. There we see that \9ドル <= 13\,ドル so the correct answer is the letter stored at that node = 'h'.

Only two comparisons are required instead of 10 that would be required for a linear search of the existing structure.

Check return values for errors

The calls scanf can fail. You must check the return values to make sure they haven't or your program may crash (or worse) when given malformed input or due to low system resources. Rigorous error handling is the difference between mostly working versus bug-free software. You should strive for the latter.

Eliminate `return 0` at the end of `main`

Since C99, the compiler automatically generates the code corresponding to return 0 at the end of main so there is no need to explicitly write it.

Question 9

I think your cumulative count suggestion needs more explanation. At first glance, storing the cumulative count of each letter requires at least one \$O(n)\$ traversal of the list which is not any better than just searching the list outright.

Question 10

@twohundredping: I've added more explanation of how the tree would work to address this problem. It's more efficient (in terms of number of comparisons) than both the existing algorithm and the other proposed answers.

Question 11

@Edward wow, thanks was looking for an alternative structure implementation. Will try this out and see how much improvement I get.

Question 12

If you already have implemented a binary search on a linear array, then instead of having an actual tree structure, you can create a packed array (that is, all non-zero count letters in order with associated cumulative count) and do a binary search of that. In terms of search efficiency they are identical. If you implement a tree, read this to learn how to efficiently create a BST from a sorted array.

Question 13

Largely some readability things:

You should use getline to read an entire line. If you pass in a NULL pointer it will manage the memory allocation for you (you just have to remember to free it later).
You check for 'a' and 'z' but then later subtract the magic number 97. You can subtract 'a' instead which will make it clearer what you are doing. Same goes for subtracting '0' instead of 48.
You could sprinkle a round a few more spaces to make the code look less squeezed:
```
for (i = 0; str[i]; i = i++)
```
Name your variables more sensibly like
- current_letter instead of m
- letter_histogram instead of ar
- current_count instead of n
It makes it a lot easier to figure out what each variable is meant to be used for (especially for others).

Apart from that I can't really fault your algorithm - you count how often each letter is there and then go look for the bucket. One possibility would be to calculate the running sum of the final ar

int running_sum = 0;
for (i = 0; i < 26; ++i)
{
 ar[i] += running_sum;
 running_sum = ar[i];
}

which turns something like [22, 4, 0, 16, ...] into [22, 26, 26, 42, ...] Then you can perform a binary search for the first bucket > K instead of having to do a linear scan (so at most 5 steps instead of 26).

Question 14

Note: getline() is not part of standard C.

Question 15

Thanks for feedback, well I changed my array to store the running sum and did binary search on it but it still exceeded time limit. Maybe using preprocessing to do bulk of work may solve that as suggested by @Josay

Question 16

The following has problems. If input had 10,000 character before '\n', str[] would overfill when the '0円' was appended. At the other end, scanf("%10000[^\n] reads nothing into str if the first character is '\n'. In both cases, following code is undefined behavior as str is not set or is corrupted. UB. Better to use fgets()
```
 // problems 
 char str[10000];
 scanf("%10000[^\n]", str); // should have been 9999
 // better
 if (fgets(str, sizeof str, stdin) == NULL) return 1; // fail
 str[strcspn(str, "\n")] = '0円'; // lop off potential trailing \n
```

Check the results of input. Check range - user input is evil - do not trust it.

 // scanf("%d",&Q);
 if (scanf("%d",&Q) != 1) {
 fprintf(stderr, "Non-numeric input for Q\n");
 return 1;
 }
 // while(Q--) // What happens if -1 was entered?
 while(Q-- > 0)

int overflow leads to undefined behavior. Recommend using unsigned variables rather than int. Alternatively check for overflow
```
 // n=(n*10)+chint;
 if (n > (INT_MAX - chinit)/10) {
 fprintf(stderr, "Overflow\n");
 return 1;
 }
 n=(n*10)+chint;
```
The below are minor.
Avoid magic numbers
```
 // m = str[i]-97;
 m = str[i] - 'a';
```
Unnecessary cast. A char will promote to int (or rarely unsigned) anyway.
```
 // chint=(int)str[i]-48;
 chint = str[i] - '0';
```
Better to use standard forms of main()
```
 // int main() 
 int main(void) 
```

SylvainD SylvainD 29.8k1 gold badge49 silver badges93 bronze badges · Accepted Answer · 2016-02-12 18:20:52Z

Warning

Enabling your compiler's warnings would tell you that c is unused.

Variable declarations

Following previous comments, I consider it good practice to declare your variable in the smallest possible scope and as late as possible. This makes things clearer (and prevent you from having unused variables). Using a "recent" version of C allows you to declare your variable in the for loop.

While/for

Your loop over Q is written as a while loop but could be written as a for loop.

i=i++

This is pretty weird code and according to my compiler warnings, might be undefined behavior.

expand.c:10:25: warning: operation on ‘i’ may be undefined [-Wsequence-point]

More about this here.

Ascii values

You are mixing literal characters (like 'a') that are easy to understand and ascii values (like '97') for no obvious reason. I'd rather have the literal characters used everywhere. It makes clearer the fact that we try to convert chars to indices in once case and to numbers in the other case.

Using temp variable

It would probably make sense to use a temporary variable for str[i] in order to make things clearer.

Variable names

The variable names you are using could probably be improved. Among other things m seems pretty cryptic to me.

Draft

At this stage, I have the following code:

#include<stdio.h>
int main()
{
 int ar[26] = {0};
 char str[10000];
 scanf("%10000[^\n]", str);
 int last_char = 0;
 int nb_occ = 0;
 for(int i=0;str[i];i=i++)
 {
 int c = str[i];
 if((c>='a' && c<='z'))
 {
 ar[last_char] += nb_occ;
 last_char = c-'a';
 nb_occ=0;
 }
 else
 {
 int chint=(int)c-'0';
 nb_occ=(nb_occ*10)+chint;
 }
 }
 ar[last_char] += nb_occ;
 int Q = 0;
 scanf("%d",&Q);
 while(Q--)
 {
 int K = 0;
 scanf("%d",&K);
 for(int i = 0; i<26; i++)
 {
 K = K - ar[i];
 if(K<=0)
 {
 printf("%c\n", (i+97));
 break;
 }
 }
 if(K>0)
 printf("-1\n");
 }
 return 0;
}

Code organisation

On this kind of task, it is usually worth trying to split the input/output logic from the actual computing logic. Here, you could define 2 functions : one to initialise some object like structure from a string, one to use the structure to get the Kth character (K being the argument to the method).

This would allow you to write unit tests which would help you improve your algorithm.

Algorithm

You have used the correct ideas : use the proper data structure to store the different values used.

An improvement you could implement is to define a structure with the cumulated numbers that you compute once so that then you can get the kth element with a simple array lookup (actually as suggested by ChrisWue, maybe you'd need a binary search and not a single lookup). When multiple queries are performed (which is I think what is happening in your case), this should make things much faster.

The idea of doing some ("long") preprocessing once to make (multiple) queries much faster is a pretty classic solution in real life and in programming challenges.

Thanks, I tried changing my array to hold running sum as suggested by @ChrisWue and did binary search on it, but to success. Do you think using any other structure (tree) might speed it up or how can preprocessor work here.

Stack Exchange Network

Expanding compressed string

4 Answers 4

Use more whitespace

Use C idioms

Eliminate unused variables

Eliminate "magic numbers"

Use constants consistently

Use better variable names

Consider using a tree structure

Check return values for errors

Eliminate `return 0` at the end of `main`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Expanding compressed string

4 Answers 4

Use more whitespace

Use C idioms

Eliminate unused variables

Eliminate "magic numbers"

Use constants consistently

Use better variable names

Consider using a tree structure

Check return values for errors

Eliminate return 0 at the end of main

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Eliminate `return 0` at the end of `main`