gdnew has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks.
I'm very confused with regex..:@

The partial of the input file looks like the line below:
>sp|P49929|SC52_SHEEP DITCAEPQSVRGLRRLGRKIAHGVKKYG >gi|1771590|emb|CAA70562.1| temporin B precursor MFTLKKSLLLLFFLGTINLSL +CEEERNAEEERRDEPDERDVQVEKRLLPI

I only want to get first 2 values separated by a |
Output expected:
sp == P49929 gi == 1771592
I try using the regex below:
##$lines contains the string to be matched if(/^>(\w+)\|(.*)?\|.*?$line.*\n/i){ $flag=1; print OUT "$_"; print "$1 == $2\n"; }
It will print
sp == P49929 as expected
gi == 1771592|emb|CAA70564.1 unexpected
Any suggestion??
Thank you in advanced.

Replies are listed 'Best First'.
Re: get some part of the string using regex
by davis (Vicar) on May 17, 2002 at 09:10 UTC
    Hi,
    If I were you, I wouldn't use a regex to split the input data:
    #!/usr/bin/perl -w use strict; while(<>) { chomp; next unless($_ =~ /\|/); #Skip if no pipe sy +mbol my @fields = split '\|', $_; $fields[0] =~ s/^>//; #Remove leading ang +le-bracket print $fields[0], " == ", $fields[1], "\n"; }
    Cheers

    davis
    Is this going out live?
    No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist
      Of course when using split you can always limit the number times the string gets split by adding the third parameter:
      my $foo = 'abc|def|ghi|jkl|mno'; my @bar = split('|', $foo, 3);
      Since you are only interested in the first two fields.
Re: get some part of the string using regex
by jmcnamara (Monsignor) on May 17, 2002 at 09:11 UTC

    The problem is that your regex is "greedy": it is matching as far ahead as possible. A negated character class in the regex is one way of dealing with this:
    #!usr/bin/perl -w use strict; my @lines = qw( >sp|P49929|SC52_SHEEP DITCAEPQSVRGLRRLGRKIAHGVKKYG >gi|1771590|emb|CAA70562.1| temporin B precursor ); foreach (@lines) { if (/^>(\w+)\|([^|]+)/){ print "$1 == $2\n"; } } __END__ Prints: sp == P49929 gi == 1771590

    Also, if the second field only contains word charcters you could use a simpler match:     /^>(\w+)\|(\w+)/

    --
    John.

Re: get some part of the string using regex
by Dog and Pony (Priest) on May 17, 2002 at 09:13 UTC
    If you want a non-greedy dot-star, you need to replace (.*)? with (.*?).

    Maybe you should even consider something like \|([^|]*)\| to be sure you can't cross the boundary of the pipes (|).


    You have moved into a dark place.
    It is pitch black. You are likely to be eaten by a grue.
      Thanks...
      You save me...