I'd like to suggest the following implementation:
// Find an instance of substr in an array of characters
// the array of characters does not have to be null terminated
// the search is limited to the first n characters in the array.
char *strnstr(char *str, const char *substr, size_t n)
{
char *p = str, *pEnd = str+n;
size_t substr_len = strlen(substr);
if(0 == substr_len)
return str; // the empty string is contained everywhere.
pEnd -= (substr_len - 1);
for(;p < pEnd; ++p)
{
if(0 == strncmp(p, substr, substr_len))
return p;
}
return NULL;
}
The rationale for the first parameter is not to be const is that you may want to use the return value pointer to modify the array in that location.
for completeness, in C++, it's possible to add an overloaded variant that's const:
const char *strnstr(const char *str, const char *substr, size_t n)
{
return strnstr((char *)str, substr, n);
}
Any comments?
As suggested, here's a test program:
#include <iostream>
#include <cstring>
#include <string>
int main()
{
char s[] = "1234567890abcdefgh";
size_t n = sizeof(s) - 1;
const char *patterns[] = { "efgh", "0ab", "0b", NULL };
const char *result = NULL;
const char *pPattern = patterns[0];
std::cout << "array of length " << n << " is: " << s << std::endl;
for (int i = 0; pPattern; pPattern = patterns[++i])
{
result = strnstr(s, pPattern, n);
std::cout << "finding " << pPattern << " n=" << n << ": "
<< (result ? result : "(null)") << std::endl;
}
pPattern = patterns[0];
result = strnstr(s, pPattern, n-1);
std::cout << "finding " << pPattern << " n=" << n-1 << ": "
<< (result ? result : "(null)") << std::endl;
return 0;
}
Output:
array of length 18 is: 1234567890abcdefgh
finding efgh n=18: efgh
finding 0ab n=18: 0abcdefgh
finding 0b n=18: (null)
finding efgh n=17: (null)
-
\$\begingroup\$ Your question is already good. Adding an example of usage would make it even better (even a one-liner, along with example output) \$\endgroup\$Caridorc– Caridorc2016年06月20日 12:44:35 +00:00Commented Jun 20, 2016 at 12:44
-
\$\begingroup\$ I have rolled back the last edit. Please see what you may and may not do after receiving answers . \$\endgroup\$Heslacher– Heslacher2016年06月22日 04:36:52 +00:00Commented Jun 22, 2016 at 4:36
4 Answers 4
Design: The
str
as an array and searched up ton
is inconsistent with string like functions andstrstr()
. Rather than "the search is limited to the first n characters in the array", I'd also expect characters instr
that follow a null character are not searched. IMO, a design flaw.Following review assumes
str[i] == 0
has no special meaning.Weak argument name
str
.str
is the address of an array and maybe not a string. Calling a potential non-stringstr
conveys the wrong idea. Suggestsrc
, etc. When looking for sub-strings, I likeneedle
andhaystack
.As the C version does not change
str
contents, recommend making thatconst
. "The rationale for the first parameter is not to be const is that you may want to use the return value pointer to modify the array in that location." does not apply. Just castchar *
the return value. Followstrstr()
s tyle.// From C library. char *strstr( const char *s1, const char *s2); // Expected signatures: (spaced for clarity) // C char *strnstr(const char *src, const char *substr, size_t n); // C++ char *strnstr( char *src, const char *substr, size_t n); const char *strnstr(const char *src, const char *substr, size_t n);
Using a name that is close to standard names is tricky. C reserves name with certain prefixes, etc and so does *nix, etc. Maybe use
CP_strnstr()
and an optional#define strnstr CP_strnstr
.Corner case: Returning
str
withif(0 == substr_len) return str;
does not make sense whensize == 0
. I'd expectNULL
.Underflow possible. The length of the
needle
may be longer or shorter than thehaystack
// add check if (n + 1 < substr_len) { return NULL; } pEnd -= (substr_len - 1);
Minor
In debug mode, consider testing against
NULL
char *strnstr(char *str, const char *substr, size_t n) { assert(str || n == 0); assert(substr);
-
\$\begingroup\$ Point by point: 1. This was a conscious choice. The use case I had in mind was data that may arrive over the network and may include a mix of binary and ASCII data. I agree that the alternative behavior has merit as well. 2. I’ll edit the code to reflect this suggestion. 3. The way the C library is handling this silently casts away the const which is dangerous for the client code. 4. Interesting point. What does CP stand for? 5. This is a matter of personal preference. 6. This is redundant. It’s already handled by the for-loop condition. 8. Nice option. \$\endgroup\$CplusPuzzle– CplusPuzzle2016年06月22日 04:03:08 +00:00Commented Jun 22, 2016 at 4:03
-
\$\begingroup\$ @CplusPuzzle #3 Agree with you. As the function does not know if
src
was originallyconst char *
orchar *
. Withstrstr()
,, returningchar *
is the compromise that occurred whenconst
was created.strstr()
pre-datesconst
. So mimicking that style has the advantage of consistent user expectation, but is dangerous for the client code as you say. 4. CP is CplusPuzzle. 5. Returning an address outside the range imposed bysize
is problematic. IAC, such corner case benefit by being documented. 6.When the needle is longerpEnd
takes on an illegal value. \$\endgroup\$chux– chux2016年06月22日 04:40:42 +00:00Commented Jun 22, 2016 at 4:40 -
\$\begingroup\$ in 6. Even though
pEnd
takes an illegel value, it's never dereferenced, so it's fine, it's just like an int holding an "illegal number". in 5. the address is the same address that the client code was holding anyway so it's not outside the range the way I see it. \$\endgroup\$CplusPuzzle– CplusPuzzle2016年06月22日 04:52:51 +00:00Commented Jun 22, 2016 at 4:52 -
\$\begingroup\$ @CplusPuzzle If the case of a large needle, the calculation of
pEnd
attempts to calculate a pointer outside the range of&str[0]...&str[size]
. The result does not necessary have where willp < pEnd
hold. Thus the invalidity of if if(p < pEnd)` test. \$\endgroup\$chux– chux2016年06月22日 13:01:08 +00:00Commented Jun 22, 2016 at 13:01 -
1\$\begingroup\$ @CplusPuzzle C11 §6.5.6 8&9 specify this restriction, else the result is undefined behavior (UB). Should you not agree, suggest posting on SO to garner the opinion of others. \$\endgroup\$chux– chux2016年06月22日 14:43:51 +00:00Commented Jun 22, 2016 at 14:43
A few points here.
The name
strnstr
is reserved for the implementation1, so your code has undefined behavior.Unless you're fairly sure that the strings you're dealing with are fairly short (or have a lot of repetition) I'd at least consider a different algorithm such as Knuth-Morris-Pratt or one of the Boyer-Moore variants.
Use of this should almost always be restricted to C, not C++ (in C++ you'd normally want to use
std::string
, not nul-terminated byte sequences).The code can have undefined behavior. If
n
is greater than the size of buffer thatstr
is pointing at, then*pEnd = str+n;
can attempt to produce an invalid pointer. In particular, you're allowed to create a pointer to any position in the array, or you can create a pointer to a position one past the end of the array--but any more than one past the end is not allowed--even attempting to create that pointer (without ever attempting to dereference it) gives undefined behavior--and this isn't purely a technicality either--there have been real systems relatively recently that could throw a hardware exception for such a case (e.g., OS/2 1.x).
1 c draft N1570 (but the same requirements have existed since C89): "Reserved Identifiers":
All identifiers with external linkage in any of the following subclauses (including the future library directions) are always reserved for use as identifiers with external linkage.
and "Future library directions":
Function names that begin with str, mem, or wcs and a lowercase letter may be added to the declarations in the header.
-
\$\begingroup\$ Thanks for the answer. 1. I agree that the name should be changed if this is integrated into a project. 2. I think that such algorithms that have pre-processing and possibly memory allocations hidded inside should have their own dedicated API (so that the pre-processing can be re-used) 3. I just skimmed through the
std::string
API and I couldn't find equivalent behavior. how would you do it? \$\endgroup\$CplusPuzzle– CplusPuzzle2016年06月23日 04:43:15 +00:00Commented Jun 23, 2016 at 4:43 -
1\$\begingroup\$ @CplusPuzzle: It's not built into
string
itself. It's essentially similar tostd::search(foo.begin(), foo.begin() + n, bar.begin(), bar.end());
. The most obvious difference is thatsearch
returns an iterator instead of an index (so if you want an index, subtractfoo.begin()
from the return). \$\endgroup\$Jerry Coffin– Jerry Coffin2016年06月23日 05:21:32 +00:00Commented Jun 23, 2016 at 5:21
I'll suggest the following based on your code. Other than style is does have two optimizations: if needle is empty no point calculating end and as I've seen elsewhere checking first char before calling strncmp seems like a good optimization.
char *strnstr(const char *haystack, const char *needle, size_t haystackLength) {
size_t needleLength = strlen(needle);
if (needleLength == 0) {
// empty string is in any string
return haystack;
}
char *end = haystack + haystackLength - (needleLength - 1);
for (char *start = haystack; start < end; ++start) {
// checking first char before the whole needle is an optimization
if ((*start == *needle) && strncmp(start, needle, needleLength) == 0) {
return start;
}
}
return NULL;
}
As for whether haystack should be as non-const, that's how it's defined in FreeBSD which I assume is what you're shooting for. It is odd that it returns non-const and the result is a pointer to or inside of haystack. Thing is, if you are searching a variable that is const, then passing it in as non-const is weird (wrong-ish). Maybe better to make both haystack and the result const. And if the caller knows it's not actually const to typecast to non-const. ... FWIW I'm thinking in C right now. Might be different considerations in the C++ world.
The test program doesn't seem to have been reviewed yet.
A few minor problems are easily fixed:
- There's no declaration of
size_t
- it's probably better to usestd::size_t
instead. - Prefer
nullptr
toNULL
, as the former is strongly-typed. - Don't use
std::endl
when there's no need to flush the stream. i
seems redundant in the loop - we could simplyfor (const char *pPattern = patterns[0]; pPattern; ++pPattern)
. Or even clearer, omit the final null pointer, and just use range-based loop:for (const char *pPattern: patterns)
.- There's no confirmation that the returned pointer (when not null) points within the argument string (as opposed, say, to returning the search pattern argument in some cases).
- It's lacking tests of some important cases (zero-length array and/or empty search string; inputs containing null characters).
The main problem with it, though, is that it doesn't actually test the function. It exercises it with a few values, but requires the user to determine whether the behaviour was correct or not. That's the kind of tedious activity that's best shifted to the machine, so make it return EXIT_SUCCESS
if the tests all return the expected results, and EXIT_FAILURE
otherwise.
-
\$\begingroup\$ fun and curious that this 7 years old post got this renewed interest. since then, I do make sure to use nullptr, however,
std::size_t
instead ofsize_t
still feels weird and unnecessary \$\endgroup\$CplusPuzzle– CplusPuzzle2023年10月10日 04:07:52 +00:00Commented Oct 10, 2023 at 4:07 -
\$\begingroup\$ Also, calling the usage example a "test program" was clearly a misnomer \$\endgroup\$CplusPuzzle– CplusPuzzle2023年10月10日 04:10:34 +00:00Commented Oct 10, 2023 at 4:10