jwking has asked for the wisdom of the Perl Monks concerning the following question:

good day, I am looking for a regular expression to extract a number from a string beginning with 5163 and followed by 8 digits. ex: If the sting is "abcedf 5163 1234 5678" or "1234516323943293", then the regular expression should be able to extract "5163 1234 5678" from the first string and "516323943293" from the second. I can easily set the regular expression to begin with 5163 but I am having problems removing all characters prior to the 5163. Any ideas will be highly appreciated. thanks, jwking
  • Comment on How to write a regular expression to extract a number from an embedded string?

Replies are listed 'Best First'.
Re: How to write a regular expression to extract a number from an embedded string?
by davido (Cardinal) on Jul 28, 2004 at 17:33 UTC

    The solution isn't all that simple. You mention that you want to match '5163', followed by eight numeric digits. Fine. Your example, however, shows that you also want to allow for spaces embeded within the digits, and while you do want to preserve the spaces, you don't want those spaces to count as part of the eight digits. In other words, you want eight numeric digits following 5163, plus any embeded whitespace.

    You could break it down into smaller problems and tackle it like this:

    use strict; use warnings; while ( my $string = <DATA> ) { my $found = ''; my $count = 0; chomp $string; if ( $string =~ /(5163)/g ) { $found .= $1; while ( $string =~ /([\d ])/g and $count < 8 ) { $found .= $1; if( $1 =~ /\d/ ) { $count++; } } } print +($count == 8 ) ? "$string\t=>\t$found\n" : "$string\t\tFAILED:\ttoo few digits matche +d\n"; } __DATA__ abcdef 5163 1234 5678 1234516323493293 12345163234932934567890 12345163234567

    This produces output as follows:

    abcdef 5163 1234 5678 => 5163 1234 5678 1234516323493293 => 516323493293 12345163234932934567890 => 516323493293 12345613234567 FAILED too few digits matched

    If I read your question right, that's what you're looking for. It doesn't check for naughty things like "123451632123abcdef45678". Given that string, it will silently drop the embeded abcdef. If you want to allow for any character to appear (not just spaces) embeded within the number, then substitute the ([\d ]) character class with (\d|.) and be sure to use the /s modifier on the regexp.


    Dave

Re: How to write a regular expression to extract a number from an embedded string?
by diotalevi (Canon) on Jul 28, 2004 at 16:41 UTC

    Whoops. I misread the question. This finds the key tag "5163" and then finds the next eight numbers with optional interspersed whitespace.

    my ($number) = $str =~ /5163((?:\s*\d){8})/
    # Extract all numbers and whitespace including punctuation frequently +significant to numbers my $numbers = join ' ', $str =~ /([\d .-+]+)/g; # Less whitespace $numbers =~ s/(\s)\1+/$1/g;
Re: How to write a regular expression to extract a number from an embedded string?
by Skeeve (Parson) on Jul 28, 2004 at 17:58 UTC
    I think it's as easy as this:
    $_="abcedf 5163 1234 5678 or 1234516323943293"; print "$1\n" while /(5163(?:\s*\d){8})/g;

      You're making the same mistake everyone else is making. He wants eight numeric digits after 5163, and wants to preserve whitespace in addition to the eight numeric digits. You're going to end up returning less than the eight numeric digits if the input string contains whitespace. That's not what his question asks.


      Dave

Re: How to write a regular expression to extract a number from an embedded string?
by Eimi Metamorphoumai (Deacon) on Jul 28, 2004 at 17:57 UTC
    As others have said, your specifications are pretty incomplete. Here's another one that will preserve whitespace, but doesn't allow anything but whitespace between the numbers.
    my $numbers = $string =~ /(5163(?:\s*\d){8})/;
    If you want to allow anything in there (not just space, but letters or punctuation) you could use
    my $numbers = $string =~ /(5163(?:\D*\d){8})/;
    It's all about figuring out exactly what you want.

      This example fails because you're counting whitespace as part of the eight digits. Look again at the example the OP gave. He wants eight numeric digits after 5163, with the possibility of embeded whitespace. Your example would match "51631       8". That's not what he was asking for. His example output includes "5163 1234 5678" Your solution would provide "5163 1234 56" because it is counting spaces as part of the eight digits.


      Dave

        No, it isn't. I'm allowing (optional) whitespace before the digits, but it's not counting. Look again at the part that says
        (?:\s*\d){8}
        That is, exactly 8 occurences of (any number of spaces followed by) a single digit. What my code did do wrong is that it assigned $numbers to the result of the check in scalar context, not in list context. But
        my ($numbers) = $string =~ /(5163(?:\s*\d){8})/;
        will work. If you don't believe me, test it out.
Re: How to write a regular expression to extract a number from an embedded string?
by pbeckingham (Parson) on Jul 28, 2004 at 16:42 UTC

    How about this?

    my $string = 'abcdef 5163 1234 5678'; my $extracted = $string =~ /(5163(?:\d{8}|\s\d{4}\s\d{4}))/;
    Update: modified code. Thanks davido

      This solution makes the assumption that if spaces occur, they will always occur at four-digit intervals. The OP didn't say that was the case. While it may be the case, we're not sure.

      Your use of tr/// is ineffective, at any rate. It's doing nothing. If evaluated in scalar context it would return the number of spaces found, but if you intended to delete spaces, you need to add the /d modifier. However, that would be contrary to what the OP was asking. In his example output, whitespace is preserved.


      Dave

Re: How to write a regular expression to extract a number from an embedded string?
by wfsp (Abbot) on Jul 28, 2004 at 16:48 UTC
    Here's my go.
    #!/bin/perl5 use strict; use warnings; my @array = ( 'abcedf 5163 1234 5678', '1234516323943293' ); for ( @array ){ /(5163.*)/; print "$1\n"; }
    produces:
    5163 1234 5678 516323943293

    Update: Whoops. It looks as though I didn't read the question properly. See pbeckingham comment below.

      Please be aware that your suggestion of using

      .*
      will greedily include everything to the end of the string. The OP specifically states 8 digits, with possible whitespace breaking the digits into groups of 4.

      While your suggestion works on your test data, it will not work on certain other data.