comment on

Hello Monks

Again I'm parsing strings that are used in reference. I'm currently facing a problem of how to make regex continue to find another match as I've used anchoring both caret and dollar sign.

The input data is like this

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's
anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous
intracellular distribution of topoisomerase activity in placental
extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,
1981.

And I'm trying to seperate the journal name, which in this case is 'Hum. Genet.'.

Now the script I used to read input in file is below

#!/usr/bin/perl
#
# parse publications strings
#
use warnings;
use strict;
use File::Basename;
use Data::Dumper;
# use open IN => ":any", OUT=> ":utf8";

my $TITLE = 'title';
my $YEAR = 'year';
my $START_PAGE = 'start_page';
my $END_PAGE = 'end_page';
my $JOURNAL = 'journal';
my $TYPE = 'type';
my $AUTHORS = 'authors';
my $VOLUME = 'volume';

sub parse_pub2 ($) {
  my $string = shift @_;
  local $_;
  my %ret = ();

  # use re "debug";
  # pos($string) = 0;
  # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!]
++)[.?!]) ?(\([\w.]+\))? (.+)$/i) {
  # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[.
+ ]).+)$/i) {
  # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[
+A-Z]\w+[. ])([A-Z].+)$/g) {
  while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z]
+\w+[. ])([A-Z].+)$/g) {
    my $authors = $1;
    $ret{$TITLE} = $2;
    if ($4) {
      $ret{$TYPE} = $3;
      $_ = $4;
    } else {
      $_ = $3;
    }
    if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) {
      $ret{$JOURNAL}=$1;
      $ret{$VOLUME}=$2;
      $ret{$START_PAGE}=$3;
      $ret{$END_PAGE}=$4;
      $ret{$YEAR}=$5;
      my @array = split (/ /,$ret{$JOURNAL});
      # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]}));
      last if (scalar(@array) < 10);
    } else {
      $ret{$JOURNAL}=$_;
      last;
    }
  }
  return %ret;
}

my $omimf = shift @ARGV || "-";

open (INF,"$omimf") or die "Unable to open '$omimf': $!";

my $i = 1;        # line number
my $space = 0;        # was last line space
my $extra = "";        # some entries are in multiple lines

while (<INF>) {
  m/^#/ && next;
  chomp;
  s/\r$//;
  if (!$_) {
    $space = 1;
  } else {
    $space = 0;
  }
  if ($space && $extra) {
    chop ($extra);
    print "$extra\n";
    my %pub = parse_pub2($extra);
    #print Dumper(%pub);
    if (defined($pub{$VOLUME}) && defined($pub{$YEAR}) &&
        defined($pub{$START_PAGE}) && defined($pub{$JOURNAL})) {
      #print  "${pub{$JOURNAL}}[JO] AND ${pub{$YEAR}}[DP] AND ",
      #           "${pub{$VOLUME}}[VI] AND ${pub{$START_PAGE}}[PG]\n";
      print "J:$pub{$JOURNAL}\n\n";
    } else {
      #print "NOT PMID($i): ";
      #foreach (sort keys %pub) { printf "$_ -> %s,",defined($pub{$_})
+ ? $pub{$_} : "undef" }
      #print "\n";
    }
    $extra = "";
  } else {
    $extra .= "$_ ";
  }
} #continue { printf ("\r%d", $i++); } print "\n";
continue { $i++ }
exit;
[download]

Heres the same but with input as scalar to eliminate some extra code bits (as requested).

#!/usr/bin/perl
#
# parse publications strings
#
use warnings;
use strict;
use Data::Dumper;

my $TITLE = 'title';
my $YEAR = 'year';
my $START_PAGE = 'start_page';
my $END_PAGE = 'end_page';
my $JOURNAL = 'journal';
my $TYPE = 'type';
my $AUTHORS = 'authors';
my $VOLUME = 'volume';

sub parse_pub ($) {
  my $string = shift @_;
  local $_;
  my %ret = ();

  # pos($string) = 0;
  # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!]
++)[.?!]) ?(\([\w.]+\))? (.+)$/i) {
  # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[.
+ ]).+)$/i) {
  # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[
+A-Z]\w+[. ])([A-Z].+)$/g) {
  while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z]
+\w+[. ])([A-Z].+)$/g) {
    my $authors = $1;
    $ret{$TITLE} = $2;
    if ($4) {
      $ret{$TYPE} = $3;
      $_ = $4;
    } else {
      $_ = $3;
    }
    if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) {
      $ret{$JOURNAL}=$1;
      $ret{$VOLUME}=$2;
      $ret{$START_PAGE}=$3;
      $ret{$END_PAGE}=$4;
      $ret{$YEAR}=$5;
      my @array = split (/ /,$ret{$JOURNAL});
      # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]}));
      last if (scalar(@array) < 10);
    } else {
      last;
    }
  }
  return %ret;
}

my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa
+nconi's anemia: anomaly of enzyme passage through the nuclear membran
+e? Anomalous intracellular distribution of topoisomerase activity in 
+placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149
+-155, 1981.";

print "$line\n";
my %pub = parse_pub($line);
#print Dumper(%pub);
print "J:$pub{$JOURNAL}\n\n";
exit;
[download]

This prints out

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem
+ia: anomaly of enzyme passage through the nuclear membrane? Anomalous
+ intracellular distribution of topoisomerase activity in placental ex
+tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981.
J:Anomalous intracellular distribution of topoisomerase activity in pl
+acental extracts in a case of Fanconi's anemia. Hum. Genet.
[download]

What I would want it to print out is

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem
+ia: anomaly of enzyme passage through the nuclear membrane? Anomalous
+ intracellular distribution of topoisomerase activity in placental ex
+tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981.
J:Hum. Genet.
[download]

Now obviously as I've set it to find shortest string ending with one of '.!?' the match is correct. However later I test if the journal name (comes from second pattern match) is longer than 10 words. If it is 10 or more, I would like it to continue from where it left off and try to find more. I tried with while(m//g) {}, but no luck. Also using \G I failed. So any idea how to do this?

Thanks.

In reply to Enforcing growth of regex by Hena

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.