Using sed regular expression to extract domain name from file

Question 1

I'm learning regex with sed to extract the last field from file named "test". The method I'm trying gives desired output. Please suggest if this method Im trying is effective way of doing it. Also when should we use "-e" option with sed (please give an example — I couldn't find examples)

~# ] cat test
example.com. 4 IN NS b.iana-servers.net.
50times.com. 21556 IN NS ns1.50times.com.
example.com. 4 IN NS a.iana-servers.net.
~# ] cat test | sed -r 's/^[[:alnum:]]*.[[:alnum:]]*.?[a-z]*.[[:blank:]]+[0-9]+[[:blank:]]+IN[[:blank:]]+[A-Z]+[[:blank:]]+//g' | sed -r 's/\.*.$//'
b.iana-servers.net
ns1.50times.com
a.iana-servers.net

Question 2

When processing tabular data in columns, awk is often a more appropriate tool to use. The equivalent command would be

awk '{ sub("\.$", "", $NF); print $NF }' test

... which I think is more readable.

Explanation:

NF is the number of fields: for this text, 5.
$NF is the content of the last (5th) field.
sub("\.$", "", $NF) strips the trailing dot from the last field.
{ commands } executes the commands for every line in the file.

Question 3

From the GNU sed documentation:

If no -e, -f, --expression, or --file options are given on the command-line, then the first non-option argument on the command line is taken to be the script to be executed.

Your two sed commands each has one non-option argument, which gets treated as the script. It would be better practice to always explicitly put a -e in front of the script. Then you can write the command this way, as just one command instead of a pipeline:

sed -r -e 's/^[[:alnum:]]*.[[:alnum:]]*.?[a-z]*.[[:blank:]]+[0-9]+[[:blank:]]+IN[[:blank:]]+[A-Z]+[[:blank:]]+//g' \
 -e 's/\.*.$//' test

It looks like you are attempting to craft the first regex to validate each column, checking that the first column looks like a domain ending with a dot ([[:alnum:]]*.[[:alnum:]]*.?[a-z]*.), the second column looks like an integer ([0-9]+), the third column is IN, and the fourth columns is a record type ([A-Z]+).

The regex for the first column probably doesn't work the way you expect. Each . means "match any character"; it does not mean "match a dot character". To match a dot character, you would write \. instead.

If you just want to extract the last column without validation, and suppressing the trailing dot, you could just write instead:

sed -e 's/.*[ \t]\([^ \t]*\)\.$/1円/' test

[^ \t]*\.$ should match the last column ("all non-space characters followed by a dot at the end of the line"). The parentheses capture everything except the trailing dot. 1円 is a backreference referring to the first and only captured group.

I've opted to use [ \t] instead of [[:blank:]] because the latter is an extended regular expression, which is a non-standard GNU extension, and the -r option makes your command less portable.

score 3 · Accepted Answer · 2015-07-16 09:04:43Z

When processing tabular data in columns, awk is often a more appropriate tool to use. The equivalent command would be

awk '{ sub("\.$", "", $NF); print $NF }' test

... which I think is more readable.

Explanation:

NF is the number of fields: for this text, 5.
$NF is the content of the last (5th) field.
sub("\.$", "", $NF) strips the trailing dot from the last field.
{ commands } executes the commands for every line in the file.

Stack Exchange Network

Using sed regular expression to extract domain name from file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using sed regular expression to extract domain name from file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions