PixelRat has asked for the wisdom of the Perl Monks concerning the following question:

This is my first perl script and it's supposed to write all image tags/links to a logfile. The only thing that doesn't work is the log. First run works, the second run writes the correct info but does it twice and so on ...
#!/usr/bin/perl use IO::Handle; # Prints URL: and uses the input as STDIN. print "URL: "; INFO: while ($url = <STDIN>) { print "\n"; open(TMP, "> source"); print TMP `lynx -source $url`; print "\n"; print "Connecting to target.\n"; open(TEXT, "< source"); while (<TEXT>) { if ($_ =~ /^.* src=([^>]*)>.*$/ixg) { $text = $text . $1 . "\n"; } } TEXT->autoflush(1); close TEXT; print "Extracted image tags if any. \n"; $now = localtime; $check = `date -I`; open (LOG, ">> $check-spook.log"); print LOG "-------------------------\n", $url, $now, "\n-------------- +-----------\n"; print LOG $text . "\n"; print "Log has been written. \n"; print "Sleeping. \n"; unlink source; sleep 30; redo INFO; }

Edit: chipmunk 2001-11-23

Replies are listed 'Best First'.
Re: Strange Log
by chromatic (Archbishop) on Nov 23, 2001 at 04:56 UTC
    It's only doing what you ask it to do:

    $text = $text . $1 . "\n";

    Since $text is effectively a global variable, it will always get longer. If you were to do something more like this, you wouldn't have this trouble:
    while (1) { # get webpage my $text; while (<TEXT>) { if (/src=([^>]*)>*$/i) { $text .= "$1\n"; } # continue as normal sleep 30; } }
    This ensures that $text is cleared on every iteration of the outer loop.

    Stylewise, there are more Perlish ways to grab webpages. You might like the LWP library, especially LWP::Simple. It really is as simple as the name says. Also, don't forget to check the results of your open calls. They may fail.

    Update: Removed spurious forward whack in regex. There's no need for the anchors, greedy match anythings or the ignore whitespace and match all flags. They don't really hurt anything, though.

      thanks, it worked, but i had to change it a bit to:
      my $text; while (<TEXT>) { if ($_ =~ /^.* src=([^>]*)>.*$/ixg) { $text .= "$1\n"; } }
      when i ran your example i got a syntax error near "/) "
(ar0n) Re: Strange Log
by ar0n (Priest) on Nov 23, 2001 at 06:44 UTC

    It being your first script, I won't beat you over the head with the CPAN bat. However, using regular expressions to parse HTML is generally considered a no-no. It'd probably be better to use HTML::Parser or, my personal favorite, HTML::TokeParser:

    use HTML::TokeParser; # ... my $p = HTML::TokeParser->new($source); while (my $tag = $p->get_tag("img")) { push @src, $tag->[1]{src}; } # ... foreach my $src (@src) { print LOG $src, "\n"; } # ...

    For more information, see the docs on HTML::TokeParser.

    Also, crazyinsomniac twisted my arm to mention HTML::TokeParser Tutorial. There, I said it.

    [ ar0n ]

      ...beat you over the head with the CPAN bat

      A better geek metaphor has never been displayed on my screen! Thanks for that, ar0n!

      Or maybe it's just the beer.