Stamp_Guy has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I'm looking into making a program that would open a batch of HTML files, find all the image tags, calculate the size of the images (in pixels) and insert the correct width="" and height="" values. I am using Image::Size for getting the actual sizes, but here's where I'm stuck: How do I get the filename from the HTML? I could probably "roll-my-own", but I've been told that I should probably ask first before I do something like that. Has anyone else here done something of this sort? Can you guys give me any suggestions/idea? They would be greatly appreciated. Thanks!

-Stamp_Guy
If winners never quit and quitter never win, who's the fool who came up with "quit while you're ahead"?

Replies are listed 'Best First'.
(jeffa) Re: Image Size in an HTML file
by jeffa (Bishop) on Jun 02, 2001 at 19:32 UTC
    I think the best way to parse the HTML files is with HTML::Parser.

    This snippet will extract the src attributes any img tags that are found:

    use strict; use IO::File; use HTML::Parser; # version 3.15, by the way # get the contents of the HTML file my $fh = new IO::File('google.html'); my $html = do {local $/; <$fh>}; my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => \&start, 'self,tagname,attr'); $parser->parse($html); sub start { my ($parser,$tag,$attr) = @_; return unless $tag eq 'img'; # insert code to process the image file print $attr->{src}, "\n"; }
    From here you can add code to open the image file, and for the fun part, insert the new value in . . .

    Jeff

    R-R-R--R-R-R--R-R-R--R-R-R--R-R-R--
    L-L--L-L--L-L--L-L--L-L--L-L--L-L--
    
      Hey Jeffa,
      Thanks for the code snippet. I tried running it though and I got this error: "Can't locate object method "handler" via package "HTML::Parser" at parser.pl line 14 <GEN0> chunk 1". Any idea what's wrong?

      -Stamp_Guy

Re: Image Size in an HTML file
by merlyn (Sage) on Jun 02, 2001 at 20:40 UTC

      You might also want to look at HTML::LinkExtor, which I don't think existed when merlyn wrote that column, but which is designed specifically around this kind of problem.

      Update: merlyn is, of course, right--LinkExtor won't give you the context for reinserting the tags (bad ChemBoy! No coffee!). However, the reason I pointed it out is that HTML::Filter is deprecated--if you're going to write your own, similar program, HTML::TokeParser or HTML::PullParser is a more appropriate solution.



      If God had meant us to fly, he would *never* have give us the railroads.
          --Michael Flanders

        Well, HTML::LinkExtor is fine if you just want the links, but in a transformation like this, you also need all the non-link text as well. Unless you were just replacing the entire file with only a bunch of images. {grin}

        -- Randal L. Schwartz, Perl hacker

      I'm beginning to think that Vroom should place a "search merlyn's columns" box next to the "search cpan" box. that way we can find the answer to our questions easier and quicker. :~)

      Stuffy
      That's my story, and I'm sticking to it, unless I'm wrong in which case I will probably change it ;~)