Strange Log

PixelRat has asked for the wisdom of the Perl Monks concerning the following question:

This is my first perl script and it's supposed to write all image tags/links to a logfile. The only thing that doesn't work is the log. First run works, the second run writes the correct info but does it twice and so on ...

#!/usr/bin/perl

use IO::Handle;

# Prints URL: and uses the input as STDIN. 
print "URL: ";                  
INFO: while ($url = <STDIN>) {
print "\n";

open(TMP, "> source");                                
print TMP `lynx -source $url`;
print "\n";

print "Connecting to target.\n";

open(TEXT, "< source");                               
while (<TEXT>) {
        if ($_ =~ /^.* src=([^>]*)>.*$/ixg)    {
                $text = $text . $1 . "\n";
        }
}
TEXT->autoflush(1);
close TEXT;

print "Extracted image tags if any. \n";

$now = localtime;
$check = `date -I`;

open (LOG, ">> $check-spook.log");
print LOG "-------------------------\n", $url, $now, "\n--------------
+-----------\n";
print LOG $text . "\n";

print "Log has been written. \n";
print "Sleeping. \n";

unlink source;

sleep 30;

redo INFO;

}
[download]

Edit: chipmunk 2001-11-23

Comment on Strange Log Download Code

Replies are listed 'Best First'.
Re: Strange Log by chromatic (Archbishop) on Nov 23, 2001 at 04:56 UTC
It's only doing what you ask it to do: `$text = $text . $1 . "\n";` Since $text is effectively a global variable, it will always get longer. If you were to do something more like this, you wouldn't have this trouble: `while (1) { # get webpage my $text; while (<TEXT>) { if (/src=([^>])>$/i) { $text .= "$1\n"; } # continue as normal sleep 30; } }` [download] This ensures that $text is cleared on every iteration of the outer loop. Stylewise, there are more Perlish ways to grab webpages. You might like the LWP library, especially LWP::Simple. It really is as simple as the name says. Also, don't forget to check the results of your open calls. They may fail. Update: Removed spurious forward whack in regex. There's no need for the anchors, greedy match anythings or the ignore whitespace and match all flags. They don't really hurt anything, though.	[reply] [d/l] [select]
Re: Re: Strange Log by PixelRat (Sexton) on Nov 25, 2001 at 03:46 UTC
thanks, it worked, but i had to change it a bit to: `my $text; while (<TEXT>) { if ($_ =~ /^.* src=([^>])>.$/ixg) { $text .= "$1\n"; } }` [download] when i ran your example i got a syntax error near "/) "	[reply] [d/l]
(ar0n) Re: Strange Log by ar0n (Priest) on Nov 23, 2001 at 06:44 UTC
It being your first script, I won't beat you over the head with the CPAN bat. However, using regular expressions to parse HTML is generally considered a no-no. It'd probably be better to use HTML::Parser or, my personal favorite, HTML::TokeParser: `use HTML::TokeParser; # ... my $p = HTML::TokeParser->new($source); while (my $tag = $p->get_tag("img")) { push @src, $tag->[1]{src}; } # ... foreach my $src (@src) { print LOG $src, "\n"; } # ...` [download] For more information, see the docs on HTML::TokeParser. Also, crazyinsomniac twisted my arm to mention HTML::TokeParser Tutorial. There, I said it. [ ar0n ]	[reply] [d/l]
Re: (ar0n) Re: Strange Log by joealba (Hermit) on Nov 23, 2001 at 10:17 UTC
...beat you over the head with the CPAN bat A better geek metaphor has never been displayed on my screen! Thanks for that, ar0n! Or maybe it's just the beer.	[reply]