daviddhall has asked for the wisdom of the Perl Monks concerning the following question:

Ok, probably a somewhat dumb question but... Say for instance: (Changed it so it would print nicely)
$line = "a HREF="../../main.shtml" a HREF="index0002.shtml" Next Page"; How do I get the final HREF?

When I try:
if ($line =~ /a HREF=\"(.*?)\" Next Page/) { print $1; }
It returns most of the line. Any cute, easy way to tell it to only return the LAST match?

Edit 2001-03-06 by tye: Added <code> tags.

Replies are listed 'Best First'.
Re: Regular Expression Matching
by danger (Priest) on Mar 07, 2001 at 11:37 UTC

    Well, you seem to have a mistaken notion of how non-greedy matching operates (but that's not a rare problem). Consider using a negative character class like: /HREF="([^"]+)"/ instead (assuming you won't find an escaped double-quote in your href).

    Then, one way is to precede your pattern with .* and let greediness and backtracking take care of ensuring your match is the last one on the line:

    my $line = q|a HREF="../../main.shtml" a HREF="index0002.shtml" Next P +age|; if($line =~ /.*HREF="([^"]+)"/){ print "$1\n"; }

    Alternatively, you can wrap the match operator in parens (to put it into list context) and use the /g modifier to find all the matches, and then index just the final element in the return list:

    my $line = q|a HREF="../../main.shtml" a HREF="index0002.shtml" Next P +age|; if(my $link = ($line =~ /HREF="([^"]+)"/g)[-1]){ print "$link\n"; }
(Ovid) Re: Regular Expression Matching
by Ovid (Cardinal) on Mar 07, 2001 at 16:04 UTC
    Be very careful about using regular expressions to parse HTML. What if they use single quotes around attributes? What if they drop the quotes altogether? Your regex could fail.

    danger pointed out the benefit of using a negated character class. This is not only more precise, it can have huge performance benefits. This node can give you a good background on this.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      Thanks everyone! The negated character class is definetly my solution. Ovid, I looked at the discussion you sent and I think I (for the most part) understood it. However, I'm confused what the "non-backreferencing parenthesis" is for. (?: I tried to find more info on it and came up a little empty handed. Thanks for your help!
        Occassionally, you'll find a need to write a complicated regular expression, but you want to be able to group elements of it without capturing them to a dollar/number ($1, $2, etc.) variable. For example, imagine a simple log file in this format:
        line number: action filename
        A typical section of the log may have data as follows:
        9248: OPEN perl.doc 9249: DELETE incriminating_evidence.txt 9250: EDIT autoexec.bat
        Ignoring the over-simplicity of this example, what if you wanted to write a logfile analyzer that justs extracts records that have been deleted or edited? One way, though perhaps not the best way, to do that would be the following:
        while (<>) { if ( /^(\d+):\s(?:EDIT|DELETE)\s(.*)$/ ) { $results{ $1 } = $2; } }
        What the (?:xxx) does is allow me to group that alternation without capturing the value. It's useful in that it is faster than capturing the value and there's no sense in capturing data if I really don't need it (though I'd probably want to know if a file was edited or deleted).

        Also, note that I do have a dot star at the end. This is appropriate in this case because it's doing exactly what I wanted it to do: slurp up the rest of the line.

        Also, in case you weren't aware: a regular expression without a binding operator ('=~' or '!~') automatically matches against $_, as in the above example.

        Cheers,
        Ovid

        Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

(tye)Re: Regular Expression Matching
by tye (Sage) on Mar 07, 2001 at 11:38 UTC
    /a HREF="([^"]*)" Next Page/

    As you found, the regex prefers an earlier match, even if the non-greedy *? is forced to be a bit greedy. You are lucky that you have a simple delimiter.

            - tye (but my friends call me "Tye")
Re: Regular Expression Matching
by I0 (Priest) on Mar 07, 2001 at 11:35 UTC
    if( $line =~ /.*a HREF=\"(.*?)\"/s ){ print $1; }