Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Trivial HTML extractor utility

by Dominus (Parson)
on Nov 22, 2007 at 06:17 UTC ( [id://652311]=perlmeditation: print w/replies, xml ) Need Help??

I have a really useful trivial utility, called linkx, that is basically just a command-line wrapper around HTML::LinkExtor: you give it the name of an HTML file and it extracts and prints all the URLs in the file. I use this all the time:

#!/usr/bin/perl use HTML::LinkExtor; use Getopt::Std ; getopts('b:t:'); @ARGV = '-' unless @ARGV; for my $file (@ARGV) { extract($file); } sub extract { my $file = shift; unless (open F, "< $file") { warn "Couldn't open file $file: $!; skipping\n"; return; } my $p = HTML::LinkExtor->new(undef, $opt_b); while (read F, my $buf, 8192) { $p->parse($buf); } for my $ln ($p->links) { my @ln = @$ln; my $tag = shift @ln; next if $opt_t && lc($opt_t) ne lc($tag); while (@ln) { shift @ln; my $url = shift @ln; print $url, "\n" unless $seen{$url}++; } } }
You can tell this is really old because it uses two-argument open.

The -b base flag interprets all URLs relative to base base and prints out the absolute versions. The -t tag flag restricts the program to only printing out URLs that appear in that kind of entity, instead of all links. I had totally forgotten that the -t feature was in there. I wonder if it's useful?

Anyway, that's not what I wanted to write about. I also have a program that extracts referrer URLs from my web logs, and today I noticed a bunch of incoming links from Reddit. Most of these I thought I had probably seen before, but I wasn't sure. I thought if I could see the titles of the pages I would know. So I tried to get the titles:

for i in `cat reddit`; do GET $i | grep -i title done
hoping that the title element would be alone on a line. (GET is a utility that comes with Perl's LWP suite; you give it a URL and it fetches the document and prints it.)

This was a complete failure. Not only were the title elements not alone on their own lines, it seems that Reddit pages don't have any line breaks at all. The output was a big mess, and I didn't look into it in detail once I saw that this approach was a flop.

So I wrote the following item, htmlx, which solved the problem:

#!/usr/bin/perl use HTML::TreeBuilder; my @tags = @ARGV; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse_file(\*STDIN); my @elements = $tree->find(@tags); for (@elements) { my $s = $_->as_text; $s =~ tr/\n/ /; print "$s\n"; }
You give this a tag name, and then it reads HTML from standard input and prints the contents of all the entities with this tag. My Reddit searcher that didn't work became:

for i in `cat reddit`; do GET $i | htmlx title done
which did work. Hooray!

I expect that this will be useful for other stuff, but I'm not sure yet what. If I never find another use for it except as part of a GET url | htmlx title pipeline, I'll probably demote it from htmlx to just TITLE, but it's too soon to tell if that's a good idea.

I hope this is useful for someone. I hereby place all code in this post in the public domain, yadda yadda yadda. Share and enjoy!

Replies are listed 'Best First'.
Re: Trivial HTML extractor utility
by hossman (Prior) on Nov 22, 2007 at 15:02 UTC

    If you used HTML::TreeBuilder::XPath it would be even more powerful. Instead of passing in a tag on the command line you could pass in an xpath expression. XPath would let you fetch attribute values, or only instances of tags that are children of certain other tags, or have particular attributes, etc....

    If it wasn't for the "-b" option, your linkx script could just be an alias to htmlx with some args.

      If you used HTML::TreeBuilder::XPath it would be even more powerful.
      Not for me; I don't know how to write an xpath expression.

      Seriously, I think it's really interesting how we seem to have completely different outlooks on this. I was worried that the program was already excessively general and overfeaturized. I've only used it once, to extract titles, and I was considering getting rid of the command-line argument, downgrading it to a program that does nothing but extract titles. Meanwhile you, who have used it even less than I have, want to enhance it to to all sorts of other stuff.

      Maybe you have some application in mind for some of that fancy stuff, but you didn't say you did, and you didn't give an example, so I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used.

      Please take this as a serious question, not as rhetoric.

        If you used HTML::TreeBuilder::XPath it would be even more powerful.
        Not for me; I don't know how to write an xpath expression.
        You should really give it a try, it's one of the few fine things coming from the XML world. I once wrote a utility called xmlgrep, which uses XPath expressions for extracting things from HTML or XML files. For extracting links one would write:
        GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href'
        but you can also add additional conditions, for example extract only absolute links:
        GET http://www.perlmonks.org | xmlgrep -parse-html '//a/@href[contains +(.,"http://")]'
        Not for me; I don't know how to write an xpath expression.
        ...
        I wonder what value you see in enhancing the features of a program that already has way more features than have ever been used.

        Fair enough ... but I suspect if you knew XPath my comment would make more sense.

        you strike me as the kind of guy who whips up little scripts to solve problems a lot -- heck, anyone who uses perl on a regular becomes thta kind of person if they weren't already. as you say: right now it's got a feature you've never used (the ability to pick an arbitrary tag name at run time) and if you never use the script again oh well ... it's not like it took you a lot of work to code it right? But if at some point in your life you think "i need to get the <h1> tags out of all these HTML pages", you might remember your handy script use it, and then realize what you really want is the *first* <h1> out of all the files, and you'd probably add a quick option to let you pick the first instance. Then maybe 6 months later you're crunching some more HTML files and you want the "content" attribute of any <meta http-equiv="refresh" ... > tags ... so you crank out another little script.

        Or, if you know XPath, the first time you need a something a little more complicated then just all values of all the tags with a certain name, you add about 12 characters to your current script, and start passing some simple XPath expressions on the command line.

        Or you don't.

        Like you say: it's a trivial utility ... if it does everything you want then call it a day and go fishing. To answer your specific question: The value I see in enhancing it comes from the ability to gain large amount of additional functionality from a small amount of additional work.

Re: Trivial HTML extractor utility
by brian_d_foy (Abbot) on Nov 22, 2007 at 19:57 UTC

    HTML::SimpleLinkExtor comes with linktractor which does the same thing as linkx. :)

    If you just want TITLE, here's the one that I use:

    #!/usr/bin/perl require HTML::HeadParser; local( $/ ); foreach ( @ARGV ) { open my( $fh ), "<", $_ or do { warn "$!"; next }; my $p = HTML::HeadParser->new; $p->parse( <$fh> ); print "$_: ", $p->header( 'title' ), "\n"; }
    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
      HTML::SimpleLinkExtor comes with linktractor which does the same thing as linkx.
      I had an idea a while back that every Perl module should come with at least one useful demo program. For example, Net::FTP should come with a command-line FTP client program. Text::Template would come with a program that fills a template with values specified on the command line, and prints the results. But the idea never got out of the wishful thinking stage.

      It's nice to know that someone wrote a replacement for HTML::LinkExtor that has an interface that (I presume) doesn't suck quite so hard. The amount of code I had to write for linkx was appalling.

      Thanks for the pointers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://652311]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-20 07:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found