Trivial HTML extractor utility

I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:

#!/usr/bin/perl

use HTML::LinkExtor;
use Getopt::Std ;
getopts('b:t:');

@ARGV = '-' unless @ARGV;
for my $file (@ARGV) {
  extract($file);
}

sub extract {
  my $file = shift;
  unless (open F, "< $file") {
    warn "Couldn't open file $file: $!; skipping\n";
    return;
  }
  my $p = HTML::LinkExtor->new(undef, $opt_b);
  while (read F, my $buf, 8192) {
    $p->parse($buf);
  }
  for my $ln ($p->links) {
    my @ln = @$ln;
    my $tag = shift @ln;
    next if $opt_t && lc($opt_t) ne lc($tag);
    while (@ln) {
      shift @ln;
      my $url = shift @ln;
      print $url, "\n" unless $seen{$url}++;
    }
  }
}
[download]

You can tell this is really old because it uses two-argument open.

The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?

Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:

  for i in `cat reddit`; do
    GET $i | grep -i title
  done
[download]

hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)

This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.

So I wrote the following item, htmlx, which solved the problem:

#!/usr/bin/perl

use HTML::TreeBuilder;

my @tags = @ARGV;

my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file(\*STDIN);
my @elements = $tree->find(@tags);

for (@elements) {
  my $s = $_->as_text;
  $s =~ tr/\n/ /;
  print "$s\n";
}
[download]

You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:

  for i in `cat reddit`; do
    GET $i | htmlx title
  done
[download]

which did work. Hooray!

I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.

I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!

--
Mark Dominus
Perl Paraphernalia

Comment on Trivial HTML extractor utility Select or Download Code

Replies are listed 'Best First'.
Re: Trivial HTML extractor utility by hossman (Prior) on Nov 22, 2007 at 15:02 UTC
If you used HTML::TreeBuilder::XPath it would be even more powerful. Instead of passing in a tag on the command line you could pass in an xpath expression. XPath would let you fetch attribute values, or only instances of tags that are children of certain other tags, or have particular attributes, etc.... If it wasn't for the "-b" option, your linkx script could just be an alias to htmlx with some args.	[reply]
Re^2: Trivial HTML extractor utility by Dominus (Parson) on Nov 22, 2007 at 19:14 UTC
If you used HTML::TreeBuilder::XPath it would be even more powerful. Not for me; I don't know how to write an xpath expression. Seriously, I think it's really interesting how we seem to have completely different outlooks on this. I was worried that the program was already excessively general and overfeaturized. I've only used it once, to extract titles, and I was considering getting rid of the command-line argument, downgrading it to a program that does nothing but extract titles. Meanwhile you, who have used it even less than I have, want to enhance it to to all sorts of other stuff. Maybe you have some application in mind for some of that fancy stuff, but you didn't say you did, and you didn't give an example, so I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used. Please take this as a serious question, not as rhetoric. -- Mark Dominus Perl Paraphernalia	[reply]
Re^3: Trivial HTML extractor utility by eserte (Deacon) on Nov 22, 2007 at 20:53 UTC
If you used HTML::TreeBuilder::XPath it would be even more powerful. Not for me; I don't know how to write an xpath expression. You should really give it a try, it's one of the few fine things coming from the XML world. I once wrote a utility called xmlgrep, which uses XPath expressions for extracting things from HTML or XML files. For extracting links one would write: `GET http://www.perlmonks.org \| xmlgrep -parse-html '//a/@href'` [download] but you can also add additional conditions, for example extract only absolute links: `GET http://www.perlmonks.org \| xmlgrep -parse-html '//a/@href[contains +(.,"http://")]'` [download]	[reply] [d/l] [select]
Re^3: Trivial HTML extractor utility by hossman (Prior) on Nov 22, 2007 at 21:17 UTC
Not for me; I don't know how to write an xpath expression. ... I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used. Fair enough ... but I suspect if you knew XPath my comment would make more sense. you strike me as the kind of guy who whips up little scripts to solve problems a lot -- heck, anyone who uses perl on a regular becomes thta kind of person if they weren't already. as you say: right now it's got a feature you've never used (the ability to pick an arbitrary tag name at run time) and if you never use the script again oh well ... it's not like it took you a lot of work to code it right? But if at some point in your life you think "i need to get the `<h1>` tags out of all these HTML pages", you might remember your handy script use it, and then realize what you really want is the first `<h1>` out of all the files, and you'd probably add a quick option to let you pick the first instance. Then maybe 6 months later you're crunching some more HTML files and you want the "content" attribute of any `<meta http-equiv="refresh" ... >` tags ... so you crank out another little script. Or, if you know XPath, the first time you need a something a little more complicated then just all values of all the tags with a certain name, you add about 12 characters to your current script, and start passing some simple XPath expressions on the command line. Or you don't. Like you say: it's a trivial utility ... if it does everything you want then call it a day and go fishing. To answer your specific question: The value I see in enhancing it comes from the ability to gain large amount of additional functionality from a small amount of additional work.	[reply] [d/l] [select]
Re: Trivial HTML extractor utility by brian_d_foy (Abbot) on Nov 22, 2007 at 19:57 UTC
HTML::SimpleLinkExtor comes with `linktractor` which does the same thing as linkx. :) If you just want `TITLE`, here's the one that I use: `#!/usr/bin/perl require HTML::HeadParser; local( $/ ); foreach ( @ARGV ) { open my( $fh ), "<", $_ or do { warn "$!"; next }; my $p = HTML::HeadParser->new; $p->parse( <$fh> ); print "$_: ", $p->header( 'title' ), "\n"; }` [download] -- brian d foy <brian@stonehenge.com> Subscribe to The Perl Review	[reply] [d/l] [select]
Re^2: Trivial HTML extractor utility by Dominus (Parson) on Nov 23, 2007 at 03:06 UTC
HTML::SimpleLinkExtor comes with `linktractor` which does the same thing as `linkx`. I had an idea a while back that every Perl module should come with at least one useful demo program. For example, `Net::FTP` should come with a command-line FTP client program. `Text::Template` would come with a program that fills a template with values specified on the command line, and prints the results. But the idea never got out of the wishful thinking stage. It's nice to know that someone wrote a replacement for `HTML::LinkExtor` that has an interface that (I presume) doesn't suck quite so hard. The amount of code I had to write for `linkx` was appalling. Thanks for the pointers. -- Mark Dominus Perl Paraphernalia	[reply]


P is for Practical
	PerlMonks