perlmeditation
Dominus
I have a really useful trivial utility, called <tt>linkx</tt>, that is basically just a command-line wrapper around <tt>HTML::LinkExtor</tt>: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:<p>
<code>
#!/usr/bin/perl
use HTML::LinkExtor;
use Getopt::Std ;
getopts('b:t:');
@ARGV = '-' unless @ARGV;
for my $file (@ARGV) {
extract($file);
}
sub extract {
my $file = shift;
unless (open F, "< $file") {
warn "Couldn't open file $file: $!; skipping\n";
return;
}
my $p = HTML::LinkExtor->new(undef, $opt_b);
while (read F, my $buf, 8192) {
$p->parse($buf);
}
for my $ln ($p->links) {
my @ln = @$ln;
my $tag = shift @ln;
next if $opt_t && lc($opt_t) ne lc($tag);
while (@ln) {
shift @ln;
my $url = shift @ln;
print $url, "\n" unless $seen{$url}++;
}
}
}
</code>
You can tell this is really old because it uses two-argument <tt>open</tt>. <p>
The <tt>-b <i>base</i></tt> flag interprets all URLs relative to base <i>base</i> and prints out the absolute versions. The <tt>-t <i>tag</i></tt> flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the <tt>-t</tt> feature was in there. I wonder if it's useful? <p>
Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:<p>
<code>
for i in `cat reddit`; do
GET $i | grep -i title
done
</code>
hoping that the <tt>title</tt> element would be alone on a line. (<tt>GET</tt> is a utility that comes with Perl's <tt>LWP</tt> suite; you give it a URL and it fetches the document and prints it.)<p>
This was a complete failure. Not only were the <tt>title</tt> elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.<p>
So I wrote the following item, <tt>htmlx</tt>, which solved the problem:<p>
<code>
#!/usr/bin/perl
use HTML::TreeBuilder;
my @tags = @ARGV;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file(\*STDIN);
my @elements = $tree->find(@tags);
for (@elements) {
my $s = $_->as_text;
$s =~ tr/\n/ /;
print "$s\n";
}
</code>
You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:<p>
<code>
for i in `cat reddit`; do
GET $i | htmlx title
done
</code>
which did work. Hooray!<p>
I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a <tt>GET <i>url</i> | htmlx title</tt> pipeline, I'll probably demote it from <tt>htmlx</tt> to just <tt>TITLE</tt>, but it's too soon to tell if that's a good idea.<p>
I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!<p>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-3737">
<p>
--<br><font size="-2">
<a href="mailto:mjd-www-perlmonks+@plover.com">Mark Dominus</a><br>
<a href="http://perl.plover.com">Perl Paraphernalia</a><br></font>
</div></div>