Hello Monks

Again I'm parsing strings that are used in reference. I'm currently facing a problem of how to make regex continue to find another match as I've used anchoring both caret and dollar sign.

The input data is like this
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's
anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous
intracellular distribution of topoisomerase activity in placental
extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,
1981.
And I'm trying to seperate the journal name, which in this case is 'Hum. Genet.'.

Now the script I used to read input in file is below
#!/usr/bin/perl # # parse publications strings # use warnings; use strict; use File::Basename; use Data::Dumper; # use open IN => ":any", OUT=> ":utf8"; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub2 ($) { my $string = shift @_; local $_; my %ret = (); # use re "debug"; # pos($string) = 0; # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!] ++)[.?!]) ?(\([\w.]+\))? (.+)$/i) { # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[. + ]).+)$/i) { # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[ +A-Z]\w+[. ])([A-Z].+)$/g) { while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z] +\w+[. ])([A-Z].+)$/g) { my $authors = $1; $ret{$TITLE} = $2; if ($4) { $ret{$TYPE} = $3; $_ = $4; } else { $_ = $3; } if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) { $ret{$JOURNAL}=$1; $ret{$VOLUME}=$2; $ret{$START_PAGE}=$3; $ret{$END_PAGE}=$4; $ret{$YEAR}=$5; my @array = split (/ /,$ret{$JOURNAL}); # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]})); last if (scalar(@array) < 10); } else { $ret{$JOURNAL}=$_; last; } } return %ret; } my $omimf = shift @ARGV || "-"; open (INF,"$omimf") or die "Unable to open '$omimf': $!"; my $i = 1; # line number my $space = 0; # was last line space my $extra = ""; # some entries are in multiple lines while (<INF>) { m/^#/ && next; chomp; s/\r$//; if (!$_) { $space = 1; } else { $space = 0; } if ($space && $extra) { chop ($extra); print "$extra\n"; my %pub = parse_pub2($extra); #print Dumper(%pub); if (defined($pub{$VOLUME}) && defined($pub{$YEAR}) && defined($pub{$START_PAGE}) && defined($pub{$JOURNAL})) { #print "${pub{$JOURNAL}}[JO] AND ${pub{$YEAR}}[DP] AND ", # "${pub{$VOLUME}}[VI] AND ${pub{$START_PAGE}}[PG]\n"; print "J:$pub{$JOURNAL}\n\n"; } else { #print "NOT PMID($i): "; #foreach (sort keys %pub) { printf "$_ -> %s,",defined($pub{$_}) + ? $pub{$_} : "undef" } #print "\n"; } $extra = ""; } else { $extra .= "$_ "; } } #continue { printf ("\r%d", $i++); } print "\n"; continue { $i++ } exit;
Heres the same but with input as scalar to eliminate some extra code bits (as requested).
#!/usr/bin/perl # # parse publications strings # use warnings; use strict; use Data::Dumper; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub ($) { my $string = shift @_; local $_; my %ret = (); # pos($string) = 0; # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!] ++)[.?!]) ?(\([\w.]+\))? (.+)$/i) { # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[. + ]).+)$/i) { # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[ +A-Z]\w+[. ])([A-Z].+)$/g) { while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z] +\w+[. ])([A-Z].+)$/g) { my $authors = $1; $ret{$TITLE} = $2; if ($4) { $ret{$TYPE} = $3; $_ = $4; } else { $_ = $3; } if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) { $ret{$JOURNAL}=$1; $ret{$VOLUME}=$2; $ret{$START_PAGE}=$3; $ret{$END_PAGE}=$4; $ret{$YEAR}=$5; my @array = split (/ /,$ret{$JOURNAL}); # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]})); last if (scalar(@array) < 10); } else { last; } } return %ret; } my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa +nconi's anemia: anomaly of enzyme passage through the nuclear membran +e? Anomalous intracellular distribution of topoisomerase activity in +placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149 +-155, 1981."; print "$line\n"; my %pub = parse_pub($line); #print Dumper(%pub); print "J:$pub{$JOURNAL}\n\n"; exit;
This prints out
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem +ia: anomaly of enzyme passage through the nuclear membrane? Anomalous + intracellular distribution of topoisomerase activity in placental ex +tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981. J:Anomalous intracellular distribution of topoisomerase activity in pl +acental extracts in a case of Fanconi's anemia. Hum. Genet.
What I would want it to print out is
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem +ia: anomaly of enzyme passage through the nuclear membrane? Anomalous + intracellular distribution of topoisomerase activity in placental ex +tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981. J:Hum. Genet.
Now obviously as I've set it to find shortest string ending with one of '.!?' the match is correct. However later I test if the journal name (comes from second pattern match) is longer than 10 words. If it is 10 or more, I would like it to continue from where it left off and try to find more. I tried with while(m//g) {}, but no luck. Also using \G I failed. So any idea how to do this?

Thanks.

In reply to Enforcing growth of regex by Hena

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.