comment on

Hello Monks, I'm currently parsing an OMIM database. This is a largish text file which can be found from here. The main bit that I'm intrested in is this kind of rows

*FIELD* RF
1. Grier, R. E.; Farrington, F. H.; Kendig, R.; Mamunes, P.: Autosomal
dominant inheritance of the Aarskog syndrome. Am. J. Med. Genet. 15:
39-46, 1983.

2. Teebi, A. S.; Rucquoi, J. K.; Meyn, M. S.: Aarskog syndrome: report
of a family with review and discussion of nosology. Am. J. Med. Genet. 46:
501-509, 1993.

3. Welch, J. P.: Elucidation of a 'new' pleiotropic connective tissue
disorder. Birth Defects Orig. Art. Ser. X(10): 138-146, 1974.

*FIELD* CS

From that I'm trying to convert to this

Am. J. Med. Genet.[JO] AND 1983[DP] AND 15[VI] AND 39[PG]
Birth Defects Orig Artic Ser.[JO] AND 1974[DP] AND 10[VI] 138[PG]
Am. J. Med. Genet.[JO] AND 1993[DP] AND 46[VI] AND 501[PG]
[download]

My script is

#!/usr/bin/perl
#
# parse the pubmed links to searchable format
#
use warnings;
use strict;
use File::Basename;

sub parse_pub ($) {
  my $string = shift @_;
  local $_;

  if ($string =~ m/^.+?:.+?\. (.+)+$/) {
    $_ = $1;
    if (m/(.+?) (\d+): (\d+)-\d+, (\d+)./) {
      return "${1}[JO] AND ${4}[DP] AND ${2}[VI] AND ${3}[PG]";
    } else {
      return undef;
    }
  }
  return;
}

my $omimf = shift @ARGV || "-";

open (INF,"$omimf") or die "Unable to open '$omimf': $!";

my $within = 0;        # within field area
my $key = "";        # current type
my $i = 1;        # line number
my $space = 0;        # was last line space
my $extra = "";        # entries are in multiple lines

while (<INF>) {
  chomp;
  s/\r$//;
  !m/\*FIELD\* RF/ && !$within && next;
  if (m/\*FIELD\*/ && $within) {
    $within = 0;
    exit;
  } elsif (m/\*FIELD\* RF/) {
    $within = 1;
  } else {
    if (!$_) {
      $space = 1;
    } else {
      $space = 0;
    }
    if ($space) {
      chop ($extra);
      if ($extra =~ m/^.+?:.+?\. (.+)+$/) {
        $extra = $1;
        # print "$string,$1\n";
        if ($extra =~ m/(.+?) (\d+): (\d+)-\d+, (\d+)./) {
          print "${1}[JO] AND ${4}[DP] AND ${2}[VI] AND ${3}[PG]\n";
        } else {
          # return undef;
        }
      }
#       if ($key = parse_pub($extra)) {
#         # print "$key\n";
#       } else {
#         print "$extra\n";
#       }
    } else {
      $extra .= "$_ ";
    }
  } # main else
}

exit;
[download]

The problem is that with the subroutine the parsing will take a very long time (had it running to 450mins or so and it failed for external reasons). Without using sub it whas taken 2.5 hours to process 1.7M rows from 2.4M rows. I though about using qr//, but the camel book suggest that will help when using variables within regex and I don't have them. So is there a way (or multiple) to speed this up?

UPDATE: Found an nasty problem there. I had forgotten to clear one field. Which seemed to cause a cascading problem.

This place where I clear the text found actually was not clearing all fields :).

  if (m/\*FIELD\*/ && $within) {
    $within = 0;
  }
[download]

should be

  if (m/\*FIELD\*/ && $within) {
    $within = 0;
    $extra = "";
  }
[download]

Thanks to those who read this anyways.

In reply to Regex, loops and subs by Hena

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.