amearse has asked for the wisdom of the Perl Monks concerning the following question:

Howdy, I'm trying to clean up this code so that the output will be more usefull. Here is the current output:
Date: Tuesday, August 14, 2001 01:57 PM Nasty website: <A HREF="http://www.fde.com/myaccount/mymail.asp?eID= a1542199776964498e">http://www.fde.com/myaccount/mymail.asp?eID=a15219 +897 76964498e</A> Date: Tue, 14 Aug 01 03:53:06 Pacific Daylight Time <div align=left><FONT face=verdana size=1>To change or remove your sub +scription click here -- <a href="<A HREF="http://www.fde.com/myaccount/mymail.as +p?eID =a1542189776964498e">http://www.fde.com/myaccount/mymail.asp?eID=a1521 +989 776964498e</A>">
Here is the Desired output:
Date: Tuesday, August 14, 2001 01:57 PM http://www.fde.com/myaccount/mymail.asp?eID=a15421989776964498e Date: Tue, 14 Aug 01 03:53:06 Pacific Daylight Time http://www.fde.com/myaccount/mymail.asp? eID=a15421989776964498e
And last but not least, here is the Code.
#!/usr/bin/perl -w use strict; # What file to use my $goober = "rfl.txt"; # Open File and pop it into an array open GOOB, $goober or die "Can't open $goober"; my @goob=<GOOB>; close(GOOB); # Make a flag to prevent dupes and start loop my $flag = 0; foreach (@goob) { if (/Date:/){ #grab and print the date print; $flag = 0; } if (s#(http://[\w./=\?-]+)#<A HREF=\"$1\">$1</A>#sgi and !$flag an +d (/myaccount/)) { # grab the url print; $flag = 1; } }
As always, any advice will be much appreciated. Peace, amearse

Replies are listed 'Best First'.
Re: Goober (better variable names)
by grinder (Bishop) on Aug 30, 2001 at 02:39 UTC

    First off, choose better variables names. $goober may strike you as really funny right now, but in the long run you are going to come back to the code and either a) cringe or b) wonder what purpose it serves.

    Even more insidious is the variable $flag. Okay, so it's a boolean. So what? What are you tracking with it... the fact that you have matched an URL? In that case, give it a better name, such as $seen_url. Get the name right, and you can get rid of the comment.

    You are testing to see whether open works correctly or not, although you should add $! into the error string somewhere, so you have an idea of why it went wrong when it does go wrong.

    A terribly common mistake made by beginners is to open a file, slurp the whole damned mess into an array, and then process the array. This consumes prolifligate amounts of memory, and when you hit gigabyte files you can bring you system to its knees. The canonical way to proceed is as follows:

    open IN, $file or die "Cannot open $file for input: $!\n"; while( <IN> ) { # do your if-flag-two-step } close IN;

    Finally, back to the beginning again, this is best written as

    my $file = shift || 'rlf.txt';

    That is, if you don't specify the name of a file, it will use rlf.txt by default, otherwise it will use whatever you specify on the command line. Much more flexible.

    Hope this helps.

    --
    g r i n d e r
Re: Goober
by George_Sherston (Vicar) on Aug 30, 2001 at 02:46 UTC
    I think the main thing is that this regex
    s#(http://[\w./=\?-]+)#<A HREF=\"$1\">$1</A>#
    is (1) inserting anchor tags that you don't want and (2) spitting out a repeat of the URL - as you see $1 appears twice in the replace string. (BTW, and sorry if I'm telling you something you probably know anyway, but the "#" is here used as a delimiter instead of a "/". It performs the same function, but makes the thing easier to manage because the search string uses so many "/"s.)

    I think this should do more what you want:
    if (#(http://[\w./=\?-]+)#sgi and !$flag and (/myaccount/))
    (You don't need to do any substitution because you just want to print out the URL; you only need the match to check you've got the right line. You could probably even slim down the regexes a bit more by combining them into one.)

    § George Sherston