I am building a Perl module which I am attempting to use as few non-core dependencies as possible. Should my following code fail, I would add a dependency on HTML::LinkExtor and be done with it, but I want to try. All I want is to extract the href=
attributes from <a>
tags. I do it using Text::Balanced which is core as of modern Perls and is installable for others. So yes, I know I should use a HTML library. That said, is this passably ok?
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Text::Balanced qw/extract_bracketed extract_delimited extract_multiple/;
my $html = q#Some <a href=link>link text</a> stuff. And a little <A HREF="link2">different link text</a>.#;
my @tags = find_anchor_targets($html);
print Dumper \@tags;
sub find_anchor_targets {
my $html = shift;
my @tags = extract_multiple(
$html,
[ sub { extract_bracketed($_[0], '<>') } ],
undef, 1
);
@tags =
map { extract_href($_) } # find related href=
grep { /^<a/i } # only anchor begin tags
@tags;
return @tags;
}
sub extract_href {
my $tag = shift;
if($tag =~ /href=(?='|")/gci) {
my $text = scalar extract_delimited( $tag, q{'"} );
my $delim = substr $text, 0, 1;
$text =~ s/^$delim//;
$text =~ s/$delim$//;
return $text;
} elsif ($tag =~ /href=(.*?)(?:\s|\n|>)/) {
return 1ドル;
} else {
return ();
}
}
This dumps
$VAR1 = [
'link',
'link2'
];
which is what one would expect.
1 Answer 1
<script language="javascript">
var a='<a href="1" title="Passably ok, yes, why not. Perfect, no.">'
document.write('<a href="2" title="Real-world HTML is just really complicated.">')
</script>
<style type="text/css">
p { font-family: "<a href='3' title='...in so many ways'>" }
</style>
-
\$\begingroup\$ Thanks, point taken. In the days since I posted this I reworked my code to use HTML::LinkExtor if available and my code elsewise. Further, once the links are extracted, they are filtered for a certain file name pattern (source files of C projects), so there isn't too much chance of false posititives. Thanks for the good examples though! \$\endgroup\$Joel Berger– Joel Berger2012年03月19日 14:33:41 +00:00Commented Mar 19, 2012 at 14:33