I started learning C
a week ago, and here is my implementation for the split
function:
char** my_split(const char* str, char delim, int* size)
{
int index = 0, start;
char** results = NULL;
*size = 0;
do
{
if ((*size) % 10 == 0) //top floor reached
{
results = realloc(results, ((*size) + 10) * sizeof(char*)); //extend the space for another 10 pointers
if (results == NULL) //if failed, return NULL
return NULL;
}
start = index;
while (str[index] != delim && str[index]) //find the delimeter or end of string
{
index++;
}
int length = index - start; //length of matching string
results[*size] = malloc((length + 1) * sizeof(char)); //allocate memory in the size of the string, and another char for '0円'
if (results[*size] == NULL) //if failed, return NULL
return NULL;
strncpy(results[*size], &str[start], length); //copy the matching string to the array
results[*size][length] = 0; //end the string
(*size)++; //add 1 to the total size
} while (str[index++]);
results = realloc(results, (*size) * sizeof(char*)); //trim unused bytes
return results;
}
Is there anything I could do better (memory, performance and error handling)? Maybe in the logic of the function?
Also, feel free to criticize my writing style, as I want it to be more readable and "correct".
3 Answers 3
Note: I would take everything I say with a grain of salt. It's been a few years since I used C on a regular basis, and even at my peak with C, I was never very familiar with the standard or certain best practices.
Your function is responsible for cleaning up memory, yet when you bail in your NULL returns, you don't do that.
In particular, if a realloc fails, the memory already allocated stays, and if allocating the second dimension-ed array fails, you still need to free what you have already allocated before returning.
A review of a few code points:
- I would highly consider abstracting away the memory stuff, but if you don't, I would consider doubling the array size instead of adding ten.
- Consider a fairly long string being split on space. Assume it ends up being 1000 split strings. This means 100 allocations, and allocating is fairly slow. If you were to double, it would be 10 allocations.
- Doubling wastes, at most,
sizeof(type) * n/2
bytes, but has at most,lg n
allocations (note: very rough math) - Doubling using more memory but is typically faster
- Doubling wastes, at most,
- Consider a fairly long string being split on space. Assume it ends up being 1000 split strings. This means 100 allocations, and allocating is fairly slow. If you were to double, it would be 10 allocations.
size
should be an unsigned type (I would use size_t)- As previously mentioned, you have potential memory leaks
- As your code exists, there's really no advantage over strtok
- strtok doesn't very well allow storing into an array, but your function requires fairly manual memory management anyway
- The malloc/strncpy combination could be changed to use strndup and be a lot simpler (I think strndup is standard?)
- You could actually use strtok inside of your function to avoid doing the index calculations
- Part of your code is recreating strtok
- The other part of your code is taking that strtok functionality and using it to push things into an array
- What you have now will (probably) perform better than strtok
I'm not familiar enough with standard practices of C to know what the expected way would be, but something about the the memory handling feels odd to me:
const char[] str = "Hello World";
int size = 0;
char** str_parts = my_split(str, ' ', &size);
if (str_parts != NULL) {
//Let's do something with the parts
for (int i = 0; i < size; ++i) {
printf("Part %d: '%s'\n", size, str_parts[i]);
}
//Now we have to clean up (could technically be in the printing loop)
for (int i = 0; i < size; ++i) {
free(str_parts[i]);
}
free(str_parts);
}
I might consider abstracting the memory away to a generalized vector or a specialized string vector.
That could bring your API to something like:
const char[] str = "Hello World";
cvector parts;
//A vector of char*'s that will call free() on each element before destructing
cvector_init(&parts, sizeof(char*), free);
if (my_split(&parts, ' ')) {
for (size_t i = 0; i < parts.size; ++i) {
//I'm a bit rusty on C, so I'm not sure if the char* cast is needed or not
printf("Part %d: '%s'\n", size, (char*) cvector_get(&parts, i));
}
}
cvector_destory(&parts);
The downside to this would be that you'd be tying your other code to a specific API (though bastardizing the vector into a char**
would actually be fairly easy if you wrote the cvector API in a particular way -- though then you'd be depending on implementation details, which is usually a bad idea), and this vector would perform worse than a plain 2d array. (Though if you wrote a string specialized vector, the performance could be very similar and, depending on design, you could even use it as a char**
without worrying much about the implications.)
-
\$\begingroup\$ That's something I haven't noticed. Thanks for the tip! \$\endgroup\$Novak– Novak2012年08月25日 07:08:45 +00:00Commented Aug 25, 2012 at 7:08
-
\$\begingroup\$ @GuyDavid No problem :). Also, I've edited the post to add a few more suggestions. \$\endgroup\$Corbin– Corbin2012年08月25日 17:10:33 +00:00Commented Aug 25, 2012 at 17:10
-
\$\begingroup\$ Wow, words can not express the gratitude which I owe you, thanks for taking your time for writing this down. I'm now reviewing and applying each one of your points. \$\endgroup\$Novak– Novak2012年08月25日 19:01:00 +00:00Commented Aug 25, 2012 at 19:01
-
\$\begingroup\$ What is the condition that should tell that I should increase the size? (using
realloc
) \$\endgroup\$Novak– Novak2012年08月25日 19:27:44 +00:00Commented Aug 25, 2012 at 19:27 -
\$\begingroup\$ @GuyDavid You will need to keep track of how many elements have been used. (The only other option would be to calculate if the number is a power of two, but that presents performance problems.) Basically if you go the doubling route, you'll need to keep track of the size of the array and how many elements you have. Every time numElems == size, double the array. \$\endgroup\$Corbin– Corbin2012年08月25日 19:44:06 +00:00Commented Aug 25, 2012 at 19:44
For only 1-week experience I'd say you are top of the class. However, I do have some notes on your function (to add to those of Corbin):
if the delimiter character is repeated you output empty strings. Is that desirable?
calls to
realloc
usually assign the result to a temporary variable so that if NULL is returned the original pointer is not lost (if it is lost, it cannot be freed).nested loops usually indicate a lazy or inexperienced programmer. Your nested while() is just a strchr() - check what functions are available in the standard library. If you are on linux/Unix/MacOS, do 'man strchr' and look down the bottom of the manual page for 'See Also'. Browse around.
sizeof(char) is 1 by definition. Use 1.
in your use of strncpy, where the length is guaranteed to be enough, strncpy guarantees to terminate the new string for you; you need not do it yourself. However...
as you already know the length of the string to be copied, strncpy is inefficient. strncpy has to check each character against 0円 to know when to stop copying. Where you know the length, memcpy is preferable (followed by manual termination, as you have it).
some of your comments are just 'noise'
be nice to readers, stick to 80 character width. This applies to readers of your code in general, not just for readers of Stack Exchange.
consider whether it would be better to copy the entire input string once at the top and then create a list of pointers into that copy, rather than allocating each sub-string.
consider whether the interface is good. This was just an exercise, so maybe the interface was specified for you. If you were designing it yourself, would you use this interface? Maybe so - I'm not saying it is wrong. But there are alternatives that would avoid the memory allocation. For example, consider passing in a writable string, an array of char* and its length and returning the number of strings. etc.
If you don't mind modifying the caller's input buffer, there's another style of doing this in C that results in fewer allocations.
Let's look at strsep
, present in some C libraries as a less ugly version of the older strtok
:
char *
strsep(char **stringp, const char *delim);
*stringp
starts out as the beginning of the string, the function searches for chars in delim
, replaces the delimiter with a NUL
character and updates *stringp
. So the caller would look like:
char buf[] = "The quick brown fox ...";
char *string = buf;
char *r;
while ((r = strsep(&string, " ")))
puts(r);
Here's what the function might look like:
char *
strsep(char **stringp, const char *delim)
{
char *ret = *stringp;
if (ret)
{
char *next = strpbrk(ret, delim);
if (next)
*next++ = 0;
*stringp = next;
}
return ret;
}
As you can see, shorter, simpler code and no allocations. It's also not so hard to implement your more allocation-heavy approach in terms of this, either: start out by duping the caller's buffer, then iterate through this function.