sed in makefile is not working as expected when using regex

Question 1

Got a makefile whith this command which convert folder names on ./cmd/ from snake_case to PascalCase

test:
 @for f in $(shell ls ./cmd/); do \
 echo $${f}; \
 echo $${f} | sed -r 's/(^|_)([a-z])/\U2円/g'; \
 done

What I get when I run it is, with a prefixed uppercase U:

api_get_manual
UapiUgetUmanual

And what I expect to get:

ApiGetManual

Question 2

You expect to be using GNU sed, but the default sed on your system is not GNU. What Unix are you running this on?

Question 3

@Kusalananda, still it doesn't complain about that other -r GNU extension. Maybe that's toybox (Android) or busybox which both copied GNU's -r but not \U.

Question 4

Please edit your question to show the output of running echo 'api_get_manual' | sed -r 's/(^|_)([a-z])/\U2円/g' outside of the Makefile. Also add the output of sed --version.

Question 5

@EdMorton Im running this on MacOS with zsh. The output of that is exactly the same as the one showed on the example. On MacOS is not possible to check the sed --version stackoverflow.com/a/37639736/4886775

Question 6

If your input can contain any other characters than underscores and lower case letters then please edit your question to include those in your example so we can see how they should be handled, e.g. should input like this_7, _That, and foo:bar become this7, That, and FooBar or something else?

Question 7

\U, like -r (for which -E is now the standard equivalent) is a non-standard extension of the GNU implementation of sed, inspired from ex/vi, also found in perl, not found in many other implementations.

Here, instead, you could do:

SHELL = zsh
test:
 @for f (cmd/*(N:t)) print -rl -- $$f $${$${(C)f}//_}

Using:

cmd/*(N:t) to expand the glob in a Nullglob fashion, getting the tail of every expansion.
${(C)var} to capitalise words in the variable
${var//_} à la ksh to remove _ characters afterwise
print -rl -- to print raw on separate lines.

Note that file names are decoded into text and converted to uppercase as per the user's locale (LC_CTYPE category).

The above, for every sequence of one or more alphanumerical characters, converts the first character to uppercase, and all the rest to lower case and removes all underscores.

A closer match to your approach that only removes the underscores that are followed by a lowercase letter (and convert only that letter to uppercase, leaving the rest alone):

SHELL = zsh
test:
 @set -o extendedglob; for f (cmd/*(N:t)) \
 print -rl -- $$f $${f//(#b)((#s)|_)([[:lower:]])/$$match[2]:u}

Where

(#b) is to activate back references, so capture groups can be referenced in the $match array in the replacement
(#s) for start, the equivalent of regex ^
[[:lower:]] matches character classified as lowercase like in regexps. [a-z] to restrict to those between a and z which in zsh is done based on codepoint value so limited to abcdefghijklmnopqrstuvwxyz
$var:u to convert to uppercase like in csh, honouring the locale.

Without zsh:

test:
 @CDPATH= cd cmd && \
 perl -le 'for (<*>) {print; s/[[:alnum:]]+/\u\L$$&/g; s/_//g; print}'

Assumes ASCII only letters (stéphane would be changed to StéPhane for instance as é is not recognised as a letter).

Or like in your approach:

test:
 @CDPATH= cd cmd && \
 perl -le 'for (<*>) {print; s/(^|_)([a-z])/\U$2ドル/g; print}'

If limited to POSIX utilities, you could use awk to do the capitalising:

test:
 @CDPATH= cd cmd && awk -- ' \
 BEGIN {for (i = 1; i < ARGC; i++) { \
 arg = ARGV[i]; out = ""; \
 print arg; \
 while (match(arg, /[[:alnum:]]+/)) { \
 out = out \
 substr(arg, 1, RSTART - 1) \
 toupper(substr(arg, RSTART, 1)) \
 tolower(substr(arg, RSTART+1, RLENGTH - 1)); \
 arg = substr(arg, RSTART+RLENGTH)}; \
 out = out arg; \
 gsub("_", "", out); \
 print out \
 } \
 }' *

Like with zsh, it will honour the locale for decoding filenames as text, classifying characters as alnum and converting to uppercase.

To match your approach:

test:
 @CDPATH= cd cmd && awk -- ' \
 BEGIN {for (i = 1; i < ARGC; i++) { \
 arg = ARGV[i]; out = ""; x = 0; \
 print arg; \
 while (match(arg, (x++ ? "_" : "(^|_)") "[[:lower:]]")) { \
 out = out \
 substr(arg, 1, RSTART-1) \
 toupper(substr(arg, RSTART+RLENGTH-1, 1)); \
 arg = substr(arg, RSTART+RLENGTH)}; \
 out = out arg; \
 gsub("_", "", out); \
 print out \
 } \
 }' *

A few other notes:

Your $(shell ...) is expanded by make into the code that as passed without any form of sanitisation, so that won't work for file names that have characters that are special in the syntax of the shell such as space, ;, *, ', etc. In fact that's a typical case of arbitrary code execution vulnerability. But then again, when using make you have to give up and hope of doing anything safely or reliably. It should really only be used with strictly controlled data (here it may be fine if you can guarantee that the cmd directory will only contain the files that you expect it to).
echo can't be used for arbitrary data
in shells other than zsh, including sh the default shell for make, parameter expansions much be quoted to prevent split+glob, so $${f} should be "$$f" (or "$${f}" if you prefer).

Question 8

The awk script at least would remove underscores that don't precede lower case letters and would upper case letters that follow other non-alphanumeric chars than underscore. I don't know if the OP can have those cases nor, if so, how they'd want them handled so I asked in a comment.

Question 9

@EdMorton, yes, all three would do that to mimic the Capitalisation parameter expansion flag of zsh and delete _ afterwards. Usual way to do snake to camel case, though for file names capitalising the extension may be undesirable.

Question 10

Good point about a possible extension. I added a comment asking about that now too.

Question 11

Your comment tells us that:

You aren't using GNU sed, which is required for \U.
Your problem has nothing to do with calling sed from a Makefile since you get the same behavior just calling sed directly on the command line.

Instead of relying on GNU sed you could do this using any awk in any shell on every Unix box:

$ echo 'api_get_manual' |
awk '{
 r = "_" 0ドル
 while ( match(r, /_[a-z]/) ) {
 r = substr(r,1,RSTART-1) toupper(substr(r,RSTART+1,1)) substr(r,RSTART+RLENGTH)
 }
 sub(/^_/, "", r)
 print r
}'
Api_Get_Manual

Here's the above running on some input that's not covered by the example in the question so you can decide if the output is desirable or not:

$ cat file
api_get_manual
this_7
_That
foo:bar
foo.pdf
bar.c

awk '{
 r = "_" 0ドル
 while ( match(r, /_[a-z]/) ) {
 r = substr(r,1,RSTART-1) toupper(substr(r,RSTART+1,1)) substr(r,RSTART+RLENGTH)
 }
 sub(/^_/, "", r)
 print 0ドル "\t-> " r
}' file
api_get_manual -> ApiGetManual
this_7 -> This_7
_That -> _That
foo:bar -> Foo:bar
foo.pdf -> Foo.pdf
bar.c -> Bar.c

To use either of the above in a Makefile 0ドル needs to become $0ドル and the awk script has to logically all be on 1 line so you need to add a couple of ;s and escape the newlines within the script, e.g. (untested):

awk '{ \
 r = "_" $0ドル; \
 while ( match(r, /_[a-z]/) ) { \
 r = substr(r,1,RSTART-1) toupper(substr(r,RSTART+1,1)) substr(r,RSTART+RLENGTH) \
 } \
 sub(/^_/, "", r); \
 print r \
}'

Question 12

Beware on most systems, [a-z] matches hundreds of characters besides the 26 letters without diacritics as used in ASCII, some of which don't have an uppercase form, so you could run in an infinite loop here. Recent versions of GNU awk have switched back to [a-z] being the same as [abcdefghijklmnopqrstuvwxyz] regardless of the locale (but still honours the locale for conversion from lower to upper case). See info gawk 'Ranges and Locales' for details.

Question 13

@StéphaneChazelas I understand the [a-z] matching issue but I'm just trying to re-use the OPs code where I don't NEED to change it to address their specific problem, thereby hopefully making what NEEDS to change more obvious, which is why I didn't use [[:lower:]] instead. Meanwhile the match() is searching for an underscore while the loop body is removing each underscore found by the match() so I'm not seeing how it could be an infinite loop.

Question 14

You're right, I had missed it removed the underscores. At worse, it would remove some consecutive underscores.

score 6 · Accepted Answer · 2025-04-21 18:19:55Z

\U, like -r (for which -E is now the standard equivalent) is a non-standard extension of the GNU implementation of sed, inspired from ex/vi, also found in perl, not found in many other implementations.

Here, instead, you could do:

SHELL = zsh
test:
 @for f (cmd/*(N:t)) print -rl -- $$f $${$${(C)f}//_}

Using:

cmd/*(N:t) to expand the glob in a Nullglob fashion, getting the tail of every expansion.
${(C)var} to capitalise words in the variable
${var//_} à la ksh to remove _ characters afterwise
print -rl -- to print raw on separate lines.

Note that file names are decoded into text and converted to uppercase as per the user's locale (LC_CTYPE category).

The above, for every sequence of one or more alphanumerical characters, converts the first character to uppercase, and all the rest to lower case and removes all underscores.

A closer match to your approach that only removes the underscores that are followed by a lowercase letter (and convert only that letter to uppercase, leaving the rest alone):

SHELL = zsh
test:
 @set -o extendedglob; for f (cmd/*(N:t)) \
 print -rl -- $$f $${f//(#b)((#s)|_)([[:lower:]])/$$match[2]:u}

Where

(#b) is to activate back references, so capture groups can be referenced in the $match array in the replacement
(#s) for start, the equivalent of regex ^
[[:lower:]] matches character classified as lowercase like in regexps. [a-z] to restrict to those between a and z which in zsh is done based on codepoint value so limited to abcdefghijklmnopqrstuvwxyz
$var:u to convert to uppercase like in csh, honouring the locale.

Without zsh:

test:
 @CDPATH= cd cmd && \
 perl -le 'for (<*>) {print; s/[[:alnum:]]+/\u\L$$&/g; s/_//g; print}'

Assumes ASCII only letters (stéphane would be changed to StéPhane for instance as é is not recognised as a letter).

Or like in your approach:

test:
 @CDPATH= cd cmd && \
 perl -le 'for (<*>) {print; s/(^|_)([a-z])/\U$2ドル/g; print}'

If limited to POSIX utilities, you could use awk to do the capitalising:

test:
 @CDPATH= cd cmd && awk -- ' \
 BEGIN {for (i = 1; i < ARGC; i++) { \
 arg = ARGV[i]; out = ""; \
 print arg; \
 while (match(arg, /[[:alnum:]]+/)) { \
 out = out \
 substr(arg, 1, RSTART - 1) \
 toupper(substr(arg, RSTART, 1)) \
 tolower(substr(arg, RSTART+1, RLENGTH - 1)); \
 arg = substr(arg, RSTART+RLENGTH)}; \
 out = out arg; \
 gsub("_", "", out); \
 print out \
 } \
 }' *

Like with zsh, it will honour the locale for decoding filenames as text, classifying characters as alnum and converting to uppercase.

To match your approach:

test:
 @CDPATH= cd cmd && awk -- ' \
 BEGIN {for (i = 1; i < ARGC; i++) { \
 arg = ARGV[i]; out = ""; x = 0; \
 print arg; \
 while (match(arg, (x++ ? "_" : "(^|_)") "[[:lower:]]")) { \
 out = out \
 substr(arg, 1, RSTART-1) \
 toupper(substr(arg, RSTART+RLENGTH-1, 1)); \
 arg = substr(arg, RSTART+RLENGTH)}; \
 out = out arg; \
 gsub("_", "", out); \
 print out \
 } \
 }' *

A few other notes:

Your $(shell ...) is expanded by make into the code that as passed without any form of sanitisation, so that won't work for file names that have characters that are special in the syntax of the shell such as space, ;, *, ', etc. In fact that's a typical case of arbitrary code execution vulnerability. But then again, when using make you have to give up and hope of doing anything safely or reliably. It should really only be used with strictly controlled data (here it may be fine if you can guarantee that the cmd directory will only contain the files that you expect it to).
echo can't be used for arbitrary data
in shells other than zsh, including sh the default shell for make, parameter expansions much be quoted to prevent split+glob, so $${f} should be "$$f" (or "$${f}" if you prefer).

The awk script at least would remove underscores that don't precede lower case letters and would upper case letters that follow other non-alphanumeric chars than underscore. I don't know if the OP can have those cases nor, if so, how they'd want them handled so I asked in a comment.
@EdMorton, yes, all three would do that to mimic the Capitalisation parameter expansion flag of zsh and delete _ afterwards. Usual way to do snake to camel case, though for file names capitalising the extension may be undesirable.
Good point about a possible extension. I added a comment asking about that now too.

Stack Exchange Network

sed in makefile is not working as expected when using regex

2 Answers 2

You must log in to answer this question.

Hot Network Questions

sed in makefile is not working as expected when using regex

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions