I wrote this for a problem in one of the online contest at hackerearth. Problem statement is:
Input:
The first line contains the compressed string in form a2b4h5a2
Expanding it means aabbbbhhhhhaa.
You have to sort the expansion in increasing order(a-z). ie. aaaabbbbhhhhh
This is followed by Q that indicates the number of queries.
Follows Q lines each line contain an integer K.
Output:
For every question output the Kth character if it exists else print -1 .
It works correctly as intended, but help me make it faster as it exceeds time limit for that portal. Any constructive criticism and advice is appreciated. :)
The idea is that, I made a integer array of size 26 to store the respective frequency of characters, thus sorting step is done automatically. Then just started subtracting the frequency value from the query until its zero or negative, that index is the required alphabet.
#include<stdio.h>
int main()
{
// made a array to store frequency of alphabets
int ar[26] = {0};
int n=0, i=0, c=0, Q=0, K=0, m=0, chint=0;
char str[10000]; // use to store the compressed format string
scanf("%10000[^\n]", str);
for(i=0;str[i];i=i++)
{
if((str[i]>='a' && str[i]<='z'))
{
ar[m] += n;
m = str[i]-97;
n=0;
}
else
{
chint=(int)str[i]-48;
n=(n*10)+chint;
}
}
ar[m] += n;
scanf("%d",&Q);
while(Q--)
{
scanf("%d",&K);
for(i = 0; i<26; i++)
{
K = K - ar[i];
if(K<=0)
{
printf("%c\n", (i+97));
break;
}
}
if(K>0)
printf("-1\n");
}
return 0;
}
-
\$\begingroup\$ Can you link the actual challenge in your post? You have left elements of the compressed format for us to guess. From your code it appears the compressed format is value-count where count is expressed as an ASCII digit and you can have several count digits in a row. This contrasts with the typical run-length encoding. \$\endgroup\$twohundredping– twohundredping2016年02月12日 18:47:02 +00:00Commented Feb 12, 2016 at 18:47
-
\$\begingroup\$ @twohundredping Well it's a part of hiring challenge still going on so i cant give u direct link to problem. [the challenge link]( hackerearth.com/nutanix-backend-hiring-challenge). Its one of the 2 coding problems in the end. \$\endgroup\$windlessStorm– windlessStorm2016年02月12日 22:22:04 +00:00Commented Feb 12, 2016 at 22:22
-
\$\begingroup\$ Is there numbers with more than one digit? \$\endgroup\$removed– removed2016年02月13日 01:23:22 +00:00Commented Feb 13, 2016 at 1:23
-
\$\begingroup\$ @WashingtonGuedes Yes there is, input can be of form a345b654c908. the frequency can be anywhere between 0-10^7. But total buffer size will be less than 10^4 characters combined that is size or str[]. \$\endgroup\$windlessStorm– windlessStorm2016年02月13日 05:40:50 +00:00Commented Feb 13, 2016 at 5:40
4 Answers 4
Warning
Enabling your compiler's warnings would tell you that c
is unused.
Variable declarations
Following previous comments, I consider it good practice to declare your variable in the smallest possible scope and as late as possible. This makes things clearer (and prevent you from having unused variables). Using a "recent" version of C allows you to declare your variable in the for
loop.
While/for
Your loop over Q
is written as a while
loop but could be written as a for
loop.
i=i++
This is pretty weird code and according to my compiler warnings, might be undefined behavior.
expand.c:10:25: warning: operation on ‘i’ may be undefined [-Wsequence-point]
Ascii values
You are mixing literal characters (like 'a') that are easy to understand and ascii values (like '97') for no obvious reason. I'd rather have the literal characters used everywhere. It makes clearer the fact that we try to convert chars to indices in once case and to numbers in the other case.
Using temp variable
It would probably make sense to use a temporary variable for str[i]
in order to make things clearer.
Variable names
The variable names you are using could probably be improved. Among other things m
seems pretty cryptic to me.
Draft
At this stage, I have the following code:
#include<stdio.h>
int main()
{
int ar[26] = {0};
char str[10000];
scanf("%10000[^\n]", str);
int last_char = 0;
int nb_occ = 0;
for(int i=0;str[i];i=i++)
{
int c = str[i];
if((c>='a' && c<='z'))
{
ar[last_char] += nb_occ;
last_char = c-'a';
nb_occ=0;
}
else
{
int chint=(int)c-'0';
nb_occ=(nb_occ*10)+chint;
}
}
ar[last_char] += nb_occ;
int Q = 0;
scanf("%d",&Q);
while(Q--)
{
int K = 0;
scanf("%d",&K);
for(int i = 0; i<26; i++)
{
K = K - ar[i];
if(K<=0)
{
printf("%c\n", (i+97));
break;
}
}
if(K>0)
printf("-1\n");
}
return 0;
}
Code organisation
On this kind of task, it is usually worth trying to split the input/output logic from the actual computing logic. Here, you could define 2 functions : one to initialise some object like structure from a string, one to use the structure to get the Kth character (K being the argument to the method).
This would allow you to write unit tests which would help you improve your algorithm.
Algorithm
You have used the correct ideas : use the proper data structure to store the different values used.
An improvement you could implement is to define a structure with the cumulated numbers that you compute once so that then you can get the kth element with a simple array lookup (actually as suggested by ChrisWue, maybe you'd need a binary search and not a single lookup). When multiple queries are performed (which is I think what is happening in your case), this should make things much faster.
The idea of doing some ("long") preprocessing once to make (multiple) queries much faster is a pretty classic solution in real life and in programming challenges.
-
\$\begingroup\$ Thanks, I tried changing my array to hold running sum as suggested by @ChrisWue and did binary search on it, but to success. Do you think using any other structure (tree) might speed it up or how can preprocessor work here. \$\endgroup\$windlessStorm– windlessStorm2016年02月12日 22:29:55 +00:00Commented Feb 12, 2016 at 22:29
I see some things that may help you improve your code.
Use more whitespace
Lines like this one:
for(i=0;str[i];i=i++)
are harder to read than they should be. Use whitespace to make the code easier to read and understand:
for(i = 0; str[i]; i = i++)
Use C idioms
The for
loop above isn't technically incorect, but it's odd-looking for those who are fluent in C. The issue is that this i = i++
should just be written as i++
.
Eliminate unused variables
This code declares a variable c
but then does nothing with it. Your compiler is smart enough to help you find this kind of problem if you know how to ask it to do so.
Eliminate "magic numbers"
This code has a number of "magic numbers," that is, unnamed constants such as 26, 48, 10000, etc. Generally it's better to avoid that and give such constants meaningful names. That way, if anything ever needs to be changed, you won't have to go hunting through the code for all instances of "3" and then trying to determine if this particular 3 is relevant to the desired change or if it is some other constant that happens to have the same value.
Since you're using 10000 as both the size of the buffer and as part of the format string for scanf
, it's a little bit tricky, but there's a nice way to do it in C:
#define BUFFLEN 10000
#define XSTR(x) STR(x)
#define STR(x) #x
Now when you want to use BUFFLEN
as a number, it can be done like this:
char str[BUFFLEN];
Using it in the format string is a little different:
scanf("%" XSTR(BUFFLEN) "[^\n]", str);
See the gcc
explanation of stringification for details on how and why this works.
Use constants consistently
In one case the code has str[i] >= 'a'
and in another it uses m = str[i] - 97
. It's much more clear if the latter were written instead as m = str[i] - 'a'
.
Use better variable names
The variable name str
is OK, but the names n
, Q
and chint
are not. The first name explains something about what the variable means within the context of the code, but the others provide no hint as to what the variable is storing or why it might be useful.
Consider using a tree structure
Ultimately, what your program needs to do is to take an input number and produce the corresponding number. The current code has the right idea in that it uses a data structure rather than an explicitly uncompressed string, but there are potentially better ways to store the data. Consider using a tree structure instead (or in addition to) the existing structure. Each node could store the cumulative count and the associated letter, so finding the answer would simply be a matter of searching the tree.
For example, the given example string translates to aaaabbbbhhhhh
. That could be represented by the following binary search tree:
If the query is "what character is at position \$n\$`, we simply search the tree. For example, if \$n=9\,ドル we start at the root node and look for a node with the smallest number \$\ge n\$. In the case that \$n=9\,ドル we see that \9ドル > 8\,ドル so move the right child node. There we see that \9ドル <= 13\,ドル so the correct answer is the letter stored at that node = 'h'.
Only two comparisons are required instead of 10 that would be required for a linear search of the existing structure.
Check return values for errors
The calls scanf
can fail. You must check the return values to make sure they haven't or your program may crash (or worse) when given malformed input or due to low system resources. Rigorous error handling is the difference between mostly working versus bug-free software. You should strive for the latter.
Eliminate return 0
at the end of main
Since C99, the compiler automatically generates the code corresponding to return 0
at the end of main
so there is no need to explicitly write it.
-
\$\begingroup\$ I think your cumulative count suggestion needs more explanation. At first glance, storing the cumulative count of each letter requires at least one \$O(n)\$ traversal of the list which is not any better than just searching the list outright. \$\endgroup\$twohundredping– twohundredping2016年02月12日 20:53:58 +00:00Commented Feb 12, 2016 at 20:53
-
\$\begingroup\$ @twohundredping: I've added more explanation of how the tree would work to address this problem. It's more efficient (in terms of number of comparisons) than both the existing algorithm and the other proposed answers. \$\endgroup\$Edward– Edward2016年02月12日 21:50:43 +00:00Commented Feb 12, 2016 at 21:50
-
\$\begingroup\$ @Edward wow, thanks was looking for an alternative structure implementation. Will try this out and see how much improvement I get. \$\endgroup\$windlessStorm– windlessStorm2016年02月12日 22:31:41 +00:00Commented Feb 12, 2016 at 22:31
-
\$\begingroup\$ If you already have implemented a binary search on a linear array, then instead of having an actual tree structure, you can create a packed array (that is, all non-zero count letters in order with associated cumulative count) and do a binary search of that. In terms of search efficiency they are identical. If you implement a tree, read this to learn how to efficiently create a BST from a sorted array. \$\endgroup\$Edward– Edward2016年02月12日 22:43:52 +00:00Commented Feb 12, 2016 at 22:43
Largely some readability things:
You should use
getline
to read an entire line. If you pass in aNULL
pointer it will manage the memory allocation for you (you just have to remember tofree
it later).You check for
'a'
and'z'
but then later subtract the magic number97
. You can subtract'a'
instead which will make it clearer what you are doing. Same goes for subtracting'0'
instead of48
.You could sprinkle a round a few more spaces to make the code look less squeezed:
for (i = 0; str[i]; i = i++)
Name your variables more sensibly like
current_letter
instead ofm
letter_histogram
instead ofar
current_count
instead ofn
It makes it a lot easier to figure out what each variable is meant to be used for (especially for others).
Apart from that I can't really fault your algorithm - you count how often each letter is there and then go look for the bucket. One possibility would be to calculate the running sum of the final ar
int running_sum = 0;
for (i = 0; i < 26; ++i)
{
ar[i] += running_sum;
running_sum = ar[i];
}
which turns something like [22, 4, 0, 16, ...]
into [22, 26, 26, 42, ...]
Then you can perform a binary search for the first bucket > K
instead of having to do a linear scan (so at most 5 steps instead of 26).
-
\$\begingroup\$ Note:
getline()
is not part of standard C. \$\endgroup\$chux– chux2016年02月12日 21:05:28 +00:00Commented Feb 12, 2016 at 21:05 -
\$\begingroup\$ Thanks for feedback, well I changed my array to store the running sum and did binary search on it but it still exceeded time limit. Maybe using preprocessing to do bulk of work may solve that as suggested by @Josay \$\endgroup\$windlessStorm– windlessStorm2016年02月12日 22:19:05 +00:00Commented Feb 12, 2016 at 22:19
The following has problems. If input had 10,000 character before
'\n'
,str[]
would overfill when the'0円'
was appended. At the other end,scanf("%10000[^\n]
reads nothing intostr
if the first character is'\n'
. In both cases, following code is undefined behavior asstr
is not set or is corrupted. UB. Better to usefgets()
// problems char str[10000]; scanf("%10000[^\n]", str); // should have been 9999 // better if (fgets(str, sizeof str, stdin) == NULL) return 1; // fail str[strcspn(str, "\n")] = '0円'; // lop off potential trailing \n
Check the results of input. Check range - user input is evil - do not trust it.
// scanf("%d",&Q); if (scanf("%d",&Q) != 1) { fprintf(stderr, "Non-numeric input for Q\n"); return 1; } // while(Q--) // What happens if -1 was entered? while(Q-- > 0)
int
overflow leads to undefined behavior. Recommend usingunsigned
variables rather thanint
. Alternatively check for overflow// n=(n*10)+chint; if (n > (INT_MAX - chinit)/10) { fprintf(stderr, "Overflow\n"); return 1; } n=(n*10)+chint;
The below are minor.
Avoid magic numbers
// m = str[i]-97; m = str[i] - 'a';
Unnecessary cast. A
char
will promote toint
(or rarelyunsigned
) anyway.// chint=(int)str[i]-48; chint = str[i] - '0';
Better to use standard forms of
main()
// int main() int main(void)
Explore related questions
See similar questions with these tags.