ssharma has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerlMonks, I am new to perl and trying to parse some reports. I am trying to get organism's name from the description. Its working when field is like this: ATP synthase F1, alpha subunit [Acidimicrobium ferrooxidans DSM 10331[. I am using this: if($desc =~ /\[(.*?)\]/){$org = $1;} But its not working when field is little complicated: enoyl-[acyl-carrier-protein] reductase [NADH] 2 [Silicibacter sp. TrichCH4B]. I would really appreciate your help. Thanks Shalabh

Replies are listed 'Best First'.
Re: regular expression
by NetWallah (Canon) on Mar 16, 2012 at 15:32 UTC
    Assuming the "name" is the LAST entry inside square brackets, you can use something like this:
    perl -e '(@x)=$ARGV[0]=~/\[(.*?)\]/g; print qq|$x[$#x];\n|' "enoyl-[a +cyl-carrier-protein] reductase [NADH] 2 [Silicibacter sp. TrichCH4B]. + " #--output--- Silicibacter sp. TrichCH4B;
    Also, the regular expression is probably better written as:
    /\[([^\]]+)\]/
    (But that assumes matching "]" are guaranteed.)

                 All great truths begin as blasphemies.
                       ― George Bernard Shaw, writer, Nobel laureate (1856-1950)

      As a side note, NetWallah's solution calls perl from the command line -- if you wanted to use it in your code, do this:

      while ($desc =~ /\[(.*?)\]/g) { } $org = $1 or '';
      or
      $desc =~ s/\[(.*?)\]//g; $org = $1 or '';

      The regex used essentially matches all data within brackets, but only saves the last one. The first example uses m//, and must therefore be run through a loop as the /g modifier merely saves the place where the previous regex match ended.

      The second option will destroy the string stored in $desc, but does not require the loop.

Re: regular expression
by muppetjones (Novice) on Mar 16, 2012 at 18:36 UTC

    Assuming the data you want is the organism information, and assuming that this information is always found in the last bracket:

    $desc =~ /.+\[(.+)\]\s*$ (?{ $org = $1 or ''; })/x;

    Breaking it down:
    /.+ will greedy search until the last bracket,
    \[(.+)\] will identify everything within the brackets,
    \s*$ will make sure we're at the end of the line,
    (?{ $org = $1 or ''; }) will store the data if any was found, and
    /x tells the regex to ignore whitespace.

    However, if you're doing this in a loop, I'd recommend the following:

    my $enz, my $org; my $rx_enz_org = qr/ (.+)\[(.+)\]\s*$ (?{ $enx = $1 or ''; $org = $2 or ''; }) /x; <begin $desc loop> $desc =~ /$rx_enz_org/; <end loop>

    qr// precompiles the regex, giving an extra bit of speed. Also, the regex saves both the enzyme and organism name (if you don't need the enzyme, then just use the previous example).

    Hope this helps!
    P.S. I haven't tested this, but I think it should work for you.

      Some comments:
      1. /.+ is not needed. You have anchored your regex at the end of the string, so there is no need to check if there is anything before the part you are looking for.
      2. \[(.+)\] will work if there are no other ] characters after next ]. Your pattern is greedy and will go all the way to the last ]. I would have written this either \[(.+?)\] (non-greedy solution) or \[([^\]]+) (everything up to the first closing bracket and you can drop the next \] in your regex.)

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: regular expression
by moritz (Cardinal) on Mar 16, 2012 at 15:24 UTC
Re: regular expression
by locked_user sundialsvc4 (Abbot) on Mar 16, 2012 at 19:30 UTC

    Also, as a matter of course, if you know that this is any kind of standard file-format, check for any CPAN modules (or Regexp::Common entries) that might already have a more thorough and complete parsing solution than you might, or might otherwise have to, cobble up on your own.

    “Failing that... simplify.”   If the string is complicated but consists of obvious “pieces,” try an algorithm that first, say, splits the string into pieces, then deal further with the particular piece(s) that you need.   “Cleverness,” otherwise known as “Perl golf,” is both difficult to troubleshoot the first time, and nearly impossible to maintain forevermore.   It looks like chicken-scratches ... it is chicken-scratches ...

    One last thought is:   spend a few additional CPU nanoseconds to check the pieces for whatever you can assert to be true, and die if anything is inconsistent.   (Split didn’t find exactly, say, 5 pieces?   DIE!   And so on.)   When dealing with a messy file-parsing situation and a very big file, the only actor on this stage who is in a position to verify that this is not Garbage-In, is the computer program itself.   If you put those kinds of tests in, it lets you say, “the program ran to completion without error, therefore, it is now very-likely that the file (and the program) did not contain any of these errors and that the results obtained are therefore much more likely to be correct and reliable.”