Win has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm having problems getting a Perl regex to pick out the following figures in this excel spread sheet record (as text file).
00CEFA0001 0.973694291 0.013140314 0 0 0 0 0 0 + 0 0 0.003278308 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0.006569697 0 0 0.00331739 0 + 0 0 0

My efforts have been along the lines of :
if ($line =~ /^0.{9}(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[ +0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20 +})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9 +]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})( +\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1 +,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[ +0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20 +})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9 +]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})( +\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1 +,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})(\t[0-9]{1,20})/) +{

Replies are listed 'Best First'.
Re: Regex problem
by ikegami (Patriarch) on Oct 05, 2005 at 16:05 UTC

    Maybe the following would be more useful:

    my @fields = split("\t", $_, -1);

    Then you can perform checks on individual fields, if you want. It'll be more readable and maintainable.

Re: Regex problem
by ides (Deacon) on Oct 05, 2005 at 16:13 UTC
Re: Regex problem
by Perl Mouse (Chaplain) on Oct 05, 2005 at 16:10 UTC
    You are not clear what you want, nor what your problems are, but to me it looks like you're checking whether the file starts with a 0, and then from character 10 onwards, whether it contains numbers and tabs, with no more than 20 numbers in a row, and no more than 2 tabs in a row either. Can't you just do something like:
    if ($line =~ /^0/) { my $end = substr ($line, 10); unless ($end =~ /[^0-9\t]/ || $end =~ /\t\t/ || $end =~ /[0-9]{21}/) { .... } }
    Perl --((8:>*
Re: Regex problem
by Skeeve (Parson) on Oct 05, 2005 at 16:27 UTC
    if ($line =~ /^0.{9}(\t\d{1,20}){49}/) {
    Will do the same as your lengthy regex, except, that it doesn't grep all the individual numbers.

    OTOH our REs won't match the line at all. I think, this one will serve the purpose better
    if (/^0[0-9a-fA-F]{9}(?:\t[\d\.]{1,20}){49}/ {
    Okay: This will also match (e.g.) IP Adresses, as i doesn't take into account, that a number may only contain 1 decimal point, but maybe it's okay. You're the only one who can tell.

    $\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
Re: Regex problem
by bioMan (Beadle) on Oct 05, 2005 at 16:43 UTC

    I see there's a lot of repetition.

    if ($line =~ /^0.{9} (\t[0-9]{1,20}) etc. etc. etc.

    Why not slurp the file into a scalar (Perl Slurp-eaze), split the file at the tabs, and send the data to an array.

    my @excelData = split /\t/, $slurpedExcelFile;

    Update - Oops!

    As noted out by ikegami the split statement should read:

    my @excelData = split "\t", $slurpedExcelFile;

    Update - Aaaaaaaa!

    A little testing shows both "/t" and /\t/ will work with split.

    If you need to check the individual values create a simpler regex and apply it to each array item.

    my @numbers = grep /# your regex/, @excelData;

    Mike

    "I need more cow bell!"