1
\$\begingroup\$

I am building a Perl module which I am attempting to use as few non-core dependencies as possible. Should my following code fail, I would add a dependency on HTML::LinkExtor and be done with it, but I want to try. All I want is to extract the href= attributes from <a> tags. I do it using Text::Balanced which is core as of modern Perls and is installable for others. So yes, I know I should use a HTML library. That said, is this passably ok?

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Text::Balanced qw/extract_bracketed extract_delimited extract_multiple/;
my $html = q#Some <a href=link>link text</a> stuff. And a little <A HREF="link2">different link text</a>.#;
my @tags = find_anchor_targets($html);
print Dumper \@tags;
sub find_anchor_targets {
 my $html = shift;
 my @tags = extract_multiple( 
 $html, 
 [ sub { extract_bracketed($_[0], '<>') } ],
 undef, 1
 );
 @tags = 
 map { extract_href($_) } # find related href=
 grep { /^<a/i } # only anchor begin tags
 @tags;
 return @tags;
}
sub extract_href {
 my $tag = shift;
 if($tag =~ /href=(?='|")/gci) {
 my $text = scalar extract_delimited( $tag, q{'"} );
 my $delim = substr $text, 0, 1;
 $text =~ s/^$delim//;
 $text =~ s/$delim$//;
 return $text;
 } elsif ($tag =~ /href=(.*?)(?:\s|\n|>)/) {
 return 1ドル;
 } else {
 return ();
 }
}

This dumps

$VAR1 = [
 'link',
 'link2'
 ];

which is what one would expect.

asked Feb 21, 2012 at 20:30
\$\endgroup\$
0

1 Answer 1

1
\$\begingroup\$
<script language="javascript">
var a='<a href="1" title="Passably ok, yes, why not. Perfect, no.">'
document.write('<a href="2" title="Real-world HTML is just really complicated.">')
</script>
<style type="text/css">
p { font-family: "<a href='3' title='...in so many ways'>" }
</style>
answered Mar 19, 2012 at 5:53
\$\endgroup\$
1
  • \$\begingroup\$ Thanks, point taken. In the days since I posted this I reworked my code to use HTML::LinkExtor if available and my code elsewise. Further, once the links are extracted, they are filtered for a certain file name pattern (source files of C projects), so there isn't too much chance of false posititives. Thanks for the good examples though! \$\endgroup\$ Commented Mar 19, 2012 at 14:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.