Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

using regex to capture a string and an array

by blackadder (Hermit)
on Nov 01, 2009 at 12:33 UTC ( [id://804335] : perlquestion . print w/replies, xml ) Need Help??

blackadder has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour Monks,..

I have strings like these;
uk1sxve01205.gfjgjf5.fdhd5 usasxve513.gfdhf4.hgfd4
I am trying to capture the first 3 chars and all the digits up to the first dot from the left. So I put this bit of code together
my ($Site_Code, @RS) = ($String=~ /^(...)....(\d+)/g);
But @RS only contains one element! Instead of what I expected that \d+)/g); will place each digit as an array element in @RS!

Enlightments s'il vou plait


Replies are listed 'Best First'.
Re: using regex to capture a string and an array
by jettero (Monsignor) on Nov 01, 2009 at 13:33 UTC
    That /g is making the whole regex try to match again. I don't think it's helping you here. Try the non-greedy anything match instead and get specific about your delimiter. You *said* what you wanted. It looks like this:
    my ($site, $digits) = $String =~ m{ ^(.{3}) # the site code .+? # noise, as little as possible though (\d+) # the digits (keepers) \. # the delimiter }x;


      Once you've got the digits as a string, you can turn that into an array thus:
      my @RS = split "", $digits;
      (This is included in the suggestion below from BioLion, but it's a bit buried in the code, so I thought I'd post it by itself. No votes required.)

      use JAPH;
      print JAPH::asString();

Re: using regex to capture a string and an array
by BioLion (Curate) on Nov 01, 2009 at 13:53 UTC

    Check out perlre and the bit about greedy matching (in fact the whole thing is wirth a read). The (\d+) will capture one *or more* digits, and so will capture all the digits in one slurp. Also using the dot character is nasty as it will match anything and can lead to unexpected nasties...

    Capturing an unknown number of things with regexes is difficult (c.f. known elements of unknown length), and so i would suggest keeping it simple and adding an intermediate step:

    use strict; use warnings; while (<DATA>){ my $input = $_; if ( ## are you sure the format is correct? $input =~ m/^(\w{3}) ## match 3 alphanumerics at the start [^\d]* ## non digits in the middle (\d+) ## capture all the digits before \. ## an actual dot /x ){ my $site_code = $1; my @rs = split '', $2; ## split the digits up into an array print "Input : $input\n\$site_code : \'$site_code\'\n\@rs :\n\t", +(join "\n\t", @rs), "\n"; } else{ ## ... process alternately? print "input \'$input\' cannot be processed.\n"; } } __DATA__ uk1sxve01205.gfjgjf5.fdhd5 usasxve513.gfdhf4.hgfd4 how_did_this_get_here?
    Just a something something...
Re: using regex to capture a string and an array
by AnomalousMonk (Archbishop) on Nov 01, 2009 at 18:38 UTC
    The following does the job with a single regex (well, almost), but is perhaps a bit too cute to live in production code:

    >perl -wMstrict -le "for my $String (@ARGV) { my ($Site_Code, @RS) = grep defined, $String =~ m{ \A (...) .{4} | \G (\d) }xmsg ; local $\" = q{' '}; print qq{site code: '$Site_Code' digits: '@RS'}; } " uk1sxve01205.gfjgjf5.fdhd5 usasxve513.gfdhf4.hgfd4 site code: 'uk1' digits: '0' '1' '2' '0' '5' site code: 'usa' digits: '5' '1' '3'
    As you will see if you eliminate the grep statement, the regex produces a rain of undefined elements.

    I think I would prefer an approach more in line with those given in other replies:

    1. Extract the decimal digits as a single string. This allows you to be very specific about what you want.
    2. If you are interested in the individual digit characters, split the string of digits to an array to get them.
Re: using regex to capture a string and an array
by ikegami (Patriarch) on Nov 01, 2009 at 18:36 UTC

    /g indicates the match should be performed repeatedly, so

    is basically
    / ^(...)....(\d+) (?: (?s:.*?) ^(...)....(\d+) (?: (?s:.*?) ^(...)....(\d+) (?: (?s:.*?) ^(...)....(\d+) etc )?)?)? /xg

    (But with less ability to backtrack)