professa has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!

I'm trying to match a slightly variable pattern and I don't get the point what I'm doing wrong. My pattern looks like this:

u70475_s_at<MAYBE_WHITESPACE><TAB>Mus musculus p45 NF-E2 related factor 2 NRF2 gene exon 2 to exon 5 and complete cds.<MAYBE_WHITESPACE><TAB>2326<MAYBE_TAB>727.3

The second number is optional, numbers can be float or integer.
I try to match the pattern like that:

$_[0] =~ m/([a-zA-Z0-9\-\_\/\.\,]+) \s*\t (.+) \s*\t ([0-9\.]+) \t* ([0-9\.]+)*/;

It works all fine, until I get a line with two numbers, in this case the first number (which I want in $3) is in $2 together with the text ((.+)). The 2nd number is in $3 then.
When I change the last /t* to /t both numbers are matched the way I want ($3 and $4) but lines with only one number are not matched at all ($1-$4 undef). What's wrong here? Any hints are welcome.

Thanx, Micha

Replies are listed 'Best First'.
Re: Pattern matching with \t
by Masem (Monsignor) on Dec 10, 2001 at 17:08 UTC
    You're making the problem more complex than it needs to be. , IMO. Since tab is your delimiter, use that effectively, then break down your solution with simple whitespace stripping:
    my ($id, $desc, $num1, $num2 ) = split /\t/; $id =~ s/(\s*)$//; $desc =~ s/(\s*)$//;
    If you have a single number line, $num2 should be zero. A line with two numbers should have both $num1 and $num2 set.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

Re: Pattern matching with \t
by Hofmator (Curate) on Dec 10, 2001 at 17:17 UTC

    Your problem is the greedy matching of the (.+) construct. This 'eats up' too much of the whole string. The reason is that you are eating up anything which is not an eol character (see also Death to Dot Star!). As your last part (tab and 2nd number) is completely optional, it is not matched.

    The easiest solution - if you are sure that in your normal text ($1) is no tab character - is to write the following:

    $_[0] =~ m/([a-zA-Z0-9\-\_\/\.\,]+) \s*\t ([^\t]+) \t ([0-9\.]+) \t? ([0-9\.]+)?/x; # note the x modifier to allow for # multiline format
    I also changed your '*' in some cases to '?' which means: match zero or one time, I think this should be the correct multiplicity.

    If the tab is allowed in the text, you have to use the non-greedy variant of '+' instead:

    $_[0] =~ m/([a-zA-Z0-9\-\_\/\.\,]+) \s*\t (.+?) \t ([0-9\.]+) \t? ([0-9\.]+)?/x;

    Update: Fixed link.

    -- Hofmator

      The tabs are only used as a delimiter of the single 'cells' in the line, not in the text parts themself.
      I tried both methods suggested here, the strip-method by Masem and yours, both work fine!
      The advantage of the pattern-matching-method is that $1-$4 are free of \n's and \r's, which I had to cut out by substitution when using 'strip' afterwards, but it's just a question of personal favour. ;-)
      Thanks to everyone here for the advice!

      Bye, Micha

Re: Pattern matching with \t
by frankus (Priest) on Dec 10, 2001 at 16:58 UTC
    Not tested this but:
    • Used i to indicate case insensitivity.
    • AFAIK special chars don't need \ in character classes.
    • Used x to allow for new lines in regex to aid readability.
    • Used ? in .+ match to make it parsimonius.
    • Used \d instead of 0-9, it means the same thing.

    $_[0] =~ m/([A-Z\d_/.,-]+) \s*\t* (.+?) \s*\t* ([\d.]+) \t* ([\d.]+)*/ix;

    --

    Brother Frankus.

    ¤