Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I'm currently parsing an OMIM database. This is a largish text file which can be found from here. The main bit that I'm intrested in is this kind of rows
*FIELD* RF
1. Grier, R. E.; Farrington, F. H.; Kendig, R.; Mamunes, P.: Autosomal
dominant inheritance of the Aarskog syndrome. Am. J. Med. Genet. 15:
39-46, 1983.

2. Teebi, A. S.; Rucquoi, J. K.; Meyn, M. S.: Aarskog syndrome: report
of a family with review and discussion of nosology. Am. J. Med. Genet. 46:
501-509, 1993.

3. Welch, J. P.: Elucidation of a 'new' pleiotropic connective tissue
disorder. Birth Defects Orig. Art. Ser. X(10): 138-146, 1974.

*FIELD* CS
From that I'm trying to convert to this
Am. J. Med. Genet.[JO] AND 1983[DP] AND 15[VI] AND 39[PG] Birth Defects Orig Artic Ser.[JO] AND 1974[DP] AND 10[VI] 138[PG] Am. J. Med. Genet.[JO] AND 1993[DP] AND 46[VI] AND 501[PG]
My script is
#!/usr/bin/perl # # parse the pubmed links to searchable format # use warnings; use strict; use File::Basename; sub parse_pub ($) { my $string = shift @_; local $_; if ($string =~ m/^.+?:.+?\. (.+)+$/) { $_ = $1; if (m/(.+?) (\d+): (\d+)-\d+, (\d+)./) { return "${1}[JO] AND ${4}[DP] AND ${2}[VI] AND ${3}[PG]"; } else { return undef; } } return; } my $omimf = shift @ARGV || "-"; open (INF,"$omimf") or die "Unable to open '$omimf': $!"; my $within = 0; # within field area my $key = ""; # current type my $i = 1; # line number my $space = 0; # was last line space my $extra = ""; # entries are in multiple lines while (<INF>) { chomp; s/\r$//; !m/\*FIELD\* RF/ && !$within && next; if (m/\*FIELD\*/ && $within) { $within = 0; exit; } elsif (m/\*FIELD\* RF/) { $within = 1; } else { if (!$_) { $space = 1; } else { $space = 0; } if ($space) { chop ($extra); if ($extra =~ m/^.+?:.+?\. (.+)+$/) { $extra = $1; # print "$string,$1\n"; if ($extra =~ m/(.+?) (\d+): (\d+)-\d+, (\d+)./) { print "${1}[JO] AND ${4}[DP] AND ${2}[VI] AND ${3}[PG]\n"; } else { # return undef; } } # if ($key = parse_pub($extra)) { # # print "$key\n"; # } else { # print "$extra\n"; # } } else { $extra .= "$_ "; } } # main else } exit;
The problem is that with the subroutine the parsing will take a very long time (had it running to 450mins or so and it failed for external reasons). Without using sub it whas taken 2.5 hours to process 1.7M rows from 2.4M rows. I though about using qr//, but the camel book suggest that will help when using variables within regex and I don't have them. So is there a way (or multiple) to speed this up?

UPDATE: Found an nasty problem there. I had forgotten to clear one field. Which seemed to cause a cascading problem.
This place where I clear the text found actually was not clearing all fields :).
if (m/\*FIELD\*/ && $within) { $within = 0; }
should be
if (m/\*FIELD\*/ && $within) { $within = 0; $extra = ""; }
Thanks to those who read this anyways.

Replies are listed 'Best First'.
Re: Regex, loops and subs
by tirwhan (Abbot) on Nov 15, 2005 at 10:28 UTC

    Try replacing all the match-everything dots in your regular expressions by negative character classes. So for example, instead of

    m/^.+?:.+?\. (.+)+$/
    do
    m/^[^:]+:[^.]+\. (.+)$/

    (what's with the plus-sign at the end anyway, you're basically saying "match one or many of any character one or many times"?)


    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
      Is the negative character class faster? I had it before, but thought that using '.' would be slightly faster (though some times this feels like micro optimisation). Since its not a user defined list as such. It might be easier to read though.

        Yes, because it avoids the useless look-aheads and backtracks on failed matches.


        Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: Regex, loops and subs
by GrandFather (Saint) on Nov 15, 2005 at 10:28 UTC

    Note that the result I get from running your code as supplied (except that I changed the read loop to read from __DATA__) is:

    Am. J. Med. Genet.[JO] AND 1983[DP] AND 15[VI] AND 39[PG] Teebi, A. S.; Rucquoi, J. K.; Meyn, M. S.: Aarskog syndrome: report of + a family with review and discussion of nosology. Am. J. Med. Genet.[ +JO] AND 1993[DP] AND 46[VI] AND 501[PG] Am. J. Med. Genet.[JO] AND 1993[DP] AND 46[VI] AND 501[PG]

    which is slightly different to the sample you provided. I suspect that that is not important, but you should confirm that that is the case.


    Perl is Huffman encoded by design.
      Yes. There is a parsing error. However my main problem is the speed. That error should be sufficiently easy to fix. I only pointed out the format that I wanted.
Re: Regex, loops and subs
by GrandFather (Saint) on Nov 15, 2005 at 11:51 UTC

    I've run out of time tonight, but the following may offer some ideas:

    #!/usr/bin/perl # # parse the pubmed links to searchable format # use warnings; use strict; use File::Basename; my $within = 0; # within field area my $space = 0; # was last line space local $/ = '*FIELD* RF'; <DATA>; #Skip prefix stuff $/ = "\n\n"; #Read a "record" at a time while (<DATA>) { chomp; exit if m/\*FIELD\*/; tr/\n/ /d; my $record = $_; if ($record =~ m/^[^:]+:.+?\. (.+)$/) { my $extra = $1; if ($extra =~ m/(.+?) (\d+): (\d+)-\d+, (\d+)./) { print "${1}[JO] AND ${4}[DP] AND ${2}[VI] AND ${3}[PG]\n"; } } }

    Prints:

    Am. J. Med. Genet.[JO] AND 1983[DP] AND 15[VI] AND 39[PG] Am. J. Med. Genet.[JO] AND 1993[DP] AND 46[VI] AND 501[PG]

    Perl is Huffman encoded by design.
Re: Regex, loops and subs
by Tanktalus (Canon) on Nov 15, 2005 at 16:33 UTC

    I wonder if a better way wouldn't be to use something from CPAN. There seem to be a number of omim-related modules there.