Converting IDN domains to Punycode in Perl

Question 1

Description

This script takes any domain input from STDIN and converts unicode domains into punycode.

Features

Any domains that throw an error get ignored.
When fed any ASCII domains, they just pass through.

convert.pl

#!/usr/bin/perl -Wn
use strict;
use Try::Tiny;
use Net::IDN::Encode ':all';
use open ':std', ':encoding(UTF-8)';
try {
 chomp $_;
 printf "%s\n",domain_to_ascii $_;
}

Sample Input:

дольщикиспб.рф
шляхтен.рф
สารสกัดจากสมุนไพร.com
google.com

Sample Output:

xn--90afmajeumr0f6a.xn--p1ai
xn--e1alhsoq4c.xn--p1ai
xn--12cau1c1a4atlh5dbe1gkg3hzj.com
google.com

I'm open to any feedback!

Question 2

"I'm wondering if it would be more efficient to check if a domain is unicode or not..." The function domain_to_ascii already does that check, see the source line 46.

Question 3

"Any domains that convert to punycode >255 characters and throw an error get ignored" Where did you get the number 255 from? I could not find that limit in the source.

Question 4

@HåkonHægland Yeah I noticed that too. I read somewhere the standard is that punycode domains <255 characters long were valid; can't remember where though lol

Question 5

Some small comments here: The program is using the shebang line:

#!/usr/bin/perl -Wn

The shebang is used when the script is run as a command from the Shell. In this case /usr/bin/perl is used to run the command. This is the so-called system perl that comes with a Unix-like operating system. However, it happens that a user installs other perl executables in addition to the system perl, for example using perlbrew. In this case, the user would like to run your script with his current choice of Perl interpreter. It might be the system perl or it could be a Perlbrew installed perl. Typically, the user arranges for the PATH environment variable to be set such that the Shell finds the correct perl. The same thing can be done with the shebang line by changing it to

#!/usr/bin/env perl

now the script is more portable since it can adapt to the current user's settings. However, there is one complication: It is not possible to pass arguments to perl in the shebang line when using /usr/bin/env. In your case you try to pass the options -Wn to perl, but it cannot be done in a portable way, see Why am I able to pass arguments to /usr/bin/env in this case?.

Luckily, it is seldom necessary to pass arguments to perl in the shebang line. Both -W and -n are better enabled from within the Perl script itself. Instead of passing -W to perl you could use the warnings pragma from within the Perl script. Similarly, the -n option is used to set up a STDIN read line-by-line-loop around your script, which can easily be implemented in the script itself.

Another thing that could help document your program (and thus make it easier to maintain) is to include some unit tests that describes the expected behavior of the program. For example:

p.pl:

#! /usr/bin/env perl
use feature qw(say);
use open ':std', ':encoding(UTF-8)';
use warnings;
use strict;
use Try::Tiny;
use Net::IDN::Encode 'domain_to_ascii';
# Written as a modulino: See Chapter 17 in "Mastering Perl". Executes main() if
# run as script, otherwise, if the file is imported from the test scripts,
# main() is not run.
main() unless caller;
sub main {
 while (<>) {
 my $line = parse_line($_);
 last if !defined $line;
 say $line;
 }
}
sub parse_line {
 my ($line) = @_;
 chomp $line;
 my $result = try {
 domain_to_ascii( $line );
 };
 return $result;
}

t/main.t:

use strict;
use warnings;
use utf8;
use open ':std', ':encoding(utf-8)';
use Test2::V0;
use lib '.';
require "p.pl";
{
 subtest "basic" => \&basic;
 subtest "fails" => \&fails;
 # TODO: Complete the test suite..
 done_testing;
}
sub basic {
 my @data = (['дольщикиспб.рф', 'xn--90afmajeumr0f6a.xn--p1ai'],
 ['สารสกัดจากสมุนไพร.com', 'xn--12cau1c1a4atlh5dbe1gkg3hzj.com'],
 ['шляхтен.рф', 'xn--e1alhsoq4c.xn--p1ai'],
 ['google.com', 'google.com']
 );
 my $i = 1;
 for my $item (@data) {
 my ($input, $output) = @$item;
 is(parse_line($input), $output, "basic $i");
 $i++;
 }
}
sub fails {
 is(parse_line("...."), U(), "empty label");
 is(parse_line("1234567890123456789012345678901234567890123456789012345678901234"), U(), "label too long (max 63 characters)");
}

You can run the tests like this:

$ prove t
t/main.t .. ok 
All tests successful.
Files=1, Tests=2, 0 wallclock secs ( 0.01 usr 0.00 sys + 0.07 cusr 0.01 csys = 0.09 CPU)
Result: PASS

or like this:

$ perl t/main.t 
# Seeded srand with seed '20210711' from local date.
ok 1 - basic {
 ok 1 - basic 1
 ok 2 - basic 2
 ok 3 - basic 3
 ok 4 - basic 4
 1..4
}
ok 2 - fails {
 ok 1 - empty label
 ok 2 - label too long (max 63 characters)
 1..2
}
1..2

Håkon Hægland Håkon Hægland 9716 silver badges14 bronze badges · Accepted Answer · 2021-07-11 20:15:26Z

Some small comments here: The program is using the shebang line:

#!/usr/bin/perl -Wn

The shebang is used when the script is run as a command from the Shell. In this case /usr/bin/perl is used to run the command. This is the so-called system perl that comes with a Unix-like operating system. However, it happens that a user installs other perl executables in addition to the system perl, for example using perlbrew. In this case, the user would like to run your script with his current choice of Perl interpreter. It might be the system perl or it could be a Perlbrew installed perl. Typically, the user arranges for the PATH environment variable to be set such that the Shell finds the correct perl. The same thing can be done with the shebang line by changing it to

#!/usr/bin/env perl

now the script is more portable since it can adapt to the current user's settings. However, there is one complication: It is not possible to pass arguments to perl in the shebang line when using /usr/bin/env. In your case you try to pass the options -Wn to perl, but it cannot be done in a portable way, see Why am I able to pass arguments to /usr/bin/env in this case?.

Luckily, it is seldom necessary to pass arguments to perl in the shebang line. Both -W and -n are better enabled from within the Perl script itself. Instead of passing -W to perl you could use the warnings pragma from within the Perl script. Similarly, the -n option is used to set up a STDIN read line-by-line-loop around your script, which can easily be implemented in the script itself.

Another thing that could help document your program (and thus make it easier to maintain) is to include some unit tests that describes the expected behavior of the program. For example:

p.pl:

#! /usr/bin/env perl
use feature qw(say);
use open ':std', ':encoding(UTF-8)';
use warnings;
use strict;
use Try::Tiny;
use Net::IDN::Encode 'domain_to_ascii';
# Written as a modulino: See Chapter 17 in "Mastering Perl". Executes main() if
# run as script, otherwise, if the file is imported from the test scripts,
# main() is not run.
main() unless caller;
sub main {
 while (<>) {
 my $line = parse_line($_);
 last if !defined $line;
 say $line;
 }
}
sub parse_line {
 my ($line) = @_;
 chomp $line;
 my $result = try {
 domain_to_ascii( $line );
 };
 return $result;
}

t/main.t:

use strict;
use warnings;
use utf8;
use open ':std', ':encoding(utf-8)';
use Test2::V0;
use lib '.';
require "p.pl";
{
 subtest "basic" => \&basic;
 subtest "fails" => \&fails;
 # TODO: Complete the test suite..
 done_testing;
}
sub basic {
 my @data = (['дольщикиспб.рф', 'xn--90afmajeumr0f6a.xn--p1ai'],
 ['สารสกัดจากสมุนไพร.com', 'xn--12cau1c1a4atlh5dbe1gkg3hzj.com'],
 ['шляхтен.рф', 'xn--e1alhsoq4c.xn--p1ai'],
 ['google.com', 'google.com']
 );
 my $i = 1;
 for my $item (@data) {
 my ($input, $output) = @$item;
 is(parse_line($input), $output, "basic $i");
 $i++;
 }
}
sub fails {
 is(parse_line("...."), U(), "empty label");
 is(parse_line("1234567890123456789012345678901234567890123456789012345678901234"), U(), "label too long (max 63 characters)");
}

You can run the tests like this:

$ prove t
t/main.t .. ok 
All tests successful.
Files=1, Tests=2, 0 wallclock secs ( 0.01 usr 0.00 sys + 0.07 cusr 0.01 csys = 0.09 CPU)
Result: PASS

or like this:

$ perl t/main.t 
# Seeded srand with seed '20210711' from local date.
ok 1 - basic {
 ok 1 - basic 1
 ok 2 - basic 2
 ok 3 - basic 3
 ok 4 - basic 4
 1..4
}
ok 2 - fails {
 ok 1 - empty label
 ok 2 - label too long (max 63 characters)
 1..2
}
1..2

Stack Exchange Network

Converting IDN domains to Punycode in Perl

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Converting IDN domains to Punycode in Perl

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions