Elijah has asked for the wisdom of the Perl Monks concerning the following question:

I have been messing around with some logic on a function and have ran into a snag. I know in regexp pattern matching the script will search a string for a certain pattern. If the pattern is found the entire string containing that pattern (seperated by whitespace) is returned and not just the matched pattern.

I need a way to return just the matched pattern from the whole matched string. I cannot think of a way to seperate the pattern from it's string.

Here is an example program I have to test against my own site for example.

#!/usr/bin/perl -w use strict; use LWP::Simple; my($target, $response); if (!$ARGV[0]) { print "Enter launch site! (don't forgot to include the \"http://\") +\n"; exit(0); } while (1) { exit(0) if(($target) && ($target eq $target)); $target = $ARGV[0] unless($target); $response = get($target) || die "Cannot get page!\n"; my @results = split(/ /, $response); foreach (@results) { if ($_ =~ m/(http\:\/\/).+/) { print "URL found: ".$_,"\n"; } } }
Any help is appreciated.

Replies are listed 'Best First'.
Re: Extract substring from string with no whitespace using regexp?
by mildside (Friar) on Feb 26, 2004 at 03:23 UTC
    Use $1, $2 etc.

    After a match, $1 will contain the part of the match in the first parentheses, and $2 will contain the part of the match in the second parentheses and so on.

    Cheers!

      ...where the parentheses are numbered by the position of the left parentheses, so "abc" =~ /((a)b(c))/ will set $1 to "abc", $2 to "a", and $3 to "c".

      If you use a regex that has parentheses in list context, the substrings are returned in a list, and can be used directly:

      my ($abc, $a, $c) = "abc" =~ /((a)b(c))/

      see perlop for more information about m// and s/// and what they return and how flags like //g affect them.

Re: Extract substring from string with no whitespace using regexp?
by graff (Chancellor) on Feb 26, 2004 at 06:03 UTC
    This is not directly related to your question: I'm wondering if I'm missing something... Was there some special purpose to be served by this part of your code:
    if (!$ARGV[0]) { print "Enter launch site! (don't forgot to include the \"http://\") +\n"; exit(0); } while (1) { exit(0) if(($target) && ($target eq $target)); $target = $ARGV[0] unless($target); ...
    Or is it just meant to be equivalent to the following?
    my $Usage = "$0 http://target.url\n"; die $Usage unless ( @ARGV == 1 and $ARGV[0] =~ m{^http://} ); $target = shift; ...
    The point is that you don't need a while loop, especially not one that suggests iteration will continue indefinitely until some condition is met -- personally, I find the latter version more appropriate (and easier). Also, I can't imagine any situation where "$target eq $target" could ever be false (given that $target is a simple scalar string), so I don't understand why you would test that.

    As another aside: some users might be grateful if it didn't insist that they type "http://" (and why not support "ftp://" as well) -- normal browsers have this flexibility:

    die $Usage unless (@ARGV == 1 and $ARGV[0] =~ m{^((?:http|ftp)://)?(\w +*.*)}i and $2); $target = ( $1 ) ? shift : ( substr($2,0,3) eq "ftp" ? "ftp://$2" : " +http://$2"; ...
    This version uses a grouping expression (?:http|ftp) which does not assign the matched region to a "capture variable" ($1, $2, etc); but if such a region is followed by "://", then that whole string (including the slashes) is assigned to $1. But even when nothing matches that first part of the regex, the remainder of the string (if any, and if it begins with something that matches \w) is always captured into $2. The assignment to $target is the whole arg if both capures were non-empty; otherwise, a suitable prefix is applied to $2 -- lots of ftp servers are using "ftp." as the first first part of the actual host name, whereas web servers now answer to "site.dom" as well as "www.site.dom".

    The output of "perldoc perlre" is very long, but very rewarding to the careful reader. Browsing through that man page is time well spent.

Re: Extract substring from string with no whitespace using regexp?
by Abigail-II (Bishop) on Feb 26, 2004 at 10:30 UTC
    I need a way to return just the matched pattern from the whole matched string.
    Use $&! It's ok in your program because searching for matches is the main thing you are doing. Otherwise, move up the closing paren (currently, the parens in the regex serve no purpose at all) to the end, and use $1. Or remove the parens, add a /g, and capture the return value in list context. Or use Regexp::Common and /$RE{URI}{HTTP}{-keep}/ and $1.

    Abigail

coding standards
by TomDLux (Vicar) on Feb 26, 2004 at 18:02 UTC

    You have ...

    foreach (@results) { if ($_ =~ m/(http\:\/\/).+/) { print "URL found: ".$_,"\n"; } }

    The reason the "default variable", $_, exists is to simplify chunks of code which all operate on the same variable by eliminating explicit reference to the variable. Admittedly, some people question whether this is simplification and clarification, or obfuscation, but that's a separate discussion thread. As far as your code is concerned, I suggest it is pointless to assign to the default variable in the loop constrtuct, and then explicitly state the variable within it.

    Either make use of the way the various functions use the default variable:

    foreach (@results) { if (/(http\:\/\/).+/) { print "URL found: "; print; print "\n"; } }

    Or else use a variable and give it a meaningful name. This is the more robust, as it allows you to use other constructs which could generate default variables, without worrying about the outer value being trashed, and more importantly, conveyes meaning to readers.

    foreach $url (@results) { if ($url =~ m/(http\:\/\/).+/) { print "URL found: " . $url, "\n"; } }

    P.S.: Back in the 70's, people dropped optional spaces from their BASIC programs, since it made it possible to fit more code into the 8K or 16K available RAM. Now that your CPU has more cache than that, use the spaces you need to make that print statement easily readable.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA