Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi, I'm trying to extract two numbers from a line separated by "to" as well as whitespace, words etc. This code works fine :
#!/usr/bin/perl while(<>){ print if m!(\d+)to(\d+)!i; print "$1\n"; print "$2\n"; } but when i add (.+) the variables ($1, $2) aren't diplayed the same: #!/usr/bin/perl while(<>){ print if m!(\d+).+to.+(\d+)!i; print "$1\n"; print "$2\n"; }
the code worked fine with just one (\s) whitespace character between number and "to" but with the (.+) it truncates the variable or something. Please help me monks, thanks in advance. Jono

Replies are listed 'Best First'.
Re: $1 and regex
by dws (Chancellor) on Aug 27, 2002 at 07:39 UTC
    Consider how   m!(\d+).+to.+(\d+)!i matches against   123to456 Since you're using greedy matching, each + will match as much as it can. The regexp matches like this:
    (12)3to45(6) $1 $2
    That's not what you intend. A first cut at fixing this is to rewrite the regex so that it it won't gobble up extra digits on either side of "to".   m!(\d+)\D+to\D+(\d+)!i; This will match target strings that have non-digit substrings surrounding the "to", but won't match "123to456", since there are no non-digit characters surrounding "to". If that's a problem, you can take the regex a step further, and write   m!(\d+)\D*?to\D*?(\d+)!i; which will accept zero or more non-digit characters on either side of "to".

      m!(\d+)\D*?to\D*?(\d+)!i;

      Why not just:

      m!(\d+)\D*to\D*(\d+)!i
      ?
      There's no reason to make the \D*s non-greedy, is there?

      -sauoq
      "My two cents aren't worth a dime.";
      
        There's no reason to make the \D*s non-greedy, is there?

        When in doubt, make your regexes non-greedy. You'll stay out of a lot of trouble that way.

Re: $1 and regex
by tmiklas (Hermit) on Aug 27, 2002 at 07:22 UTC
    If I understand you well, your input line looks like '3256 nothing special 356216'. In this case it's enough to match digit, non-digit and digit again...

    #!/usr/bin/perl while (<>) { print if m/(\d+)\D+(\d+)/; print "$1\n$2\n"; }
    The output is:
    tm@norad:~$ ./regex_test 123 testing... testing 456 123 testing... testing 456 123 456
    UPDATE: after reading next replies...
    1. .+ isn't the best notation - it shouldn't catch next digits after (\d+) but it's not specially safe AIMO, especially makes the code more difficoult to understand
    2. this 'to', whitespaces or other words you mentioned in your problem are covered with \D+ which IMvHO is more universal becouse of catching anything except digits

    Greetz, Tom.
Re: $1 and regex
by Django (Pilgrim) on Aug 27, 2002 at 07:25 UTC
    Maybe you should try ".*?", the fewest possible anythings:
    while (<>) { / (\d+) .*? to .*? (\d+) /x and print "$1\n$2\n"; }
Re: $1 and regex
by jmcnamara (Monsignor) on Aug 27, 2002 at 07:31 UTC

    The quantifier + matches "one or more times". Therefore the second .+ will match any following characters following "to" up to the last digit. Try something like this instead:     m/(\d+)\s+to\s+(\d+)/i;

    --
    John.

      While the + quantifier is often described as matching "one or more times", this description conflicts with one's inner hacker. As it is greedy, surly it should be written as something along the lines of "matching more or one times".

      tlhf
      Just trying to make things legible.

Re: $1 and regex
by hotshot (Prior) on Aug 27, 2002 at 07:23 UTC
    the '.+' is greedy and takes the digits too. if after the 'to' you have only white spaces, than use '\s+' instead of '.+' (the '.' takes any char while '\s' takes only white spaces).
    If you may have non white spaces after the 'to' than try:
    while(<>){ print if (/^(\d+).+to.+(\d+)$/i); print "$1\n"; print "$2\n"; }


    Hotshot
Re: $1 and regex
by NaSe77 (Monk) on Aug 27, 2002 at 07:27 UTC
    it is quite clear that $1 qnd $2 are not displayed the same (since the regex are different...) but u might want to use :
    while(<>){ print if m!(\d+)\D*to\D*(\d+)!i; print "$1\n"; print "$2\n"; }
    i hope i got your question right ...

    ----
    NaSe
    :x