in reply to Working with regexes

my $text=<<"TEXT"; ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: essst@md.huji.ac.il) 41401 TEXT if ( $text =~ /([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+)$/s +m ) { print "CATEGORY $1\n\n"; print "KEYWORDS $2\n\n"; print "Summar text: $3\n\n"; print "Reference: $4\n\n"; print "Id: $5\n\n"; }

Replies are listed 'Best First'.
Re^2: Working with regexes
by Anonymous Monk on Jan 06, 2005 at 21:32 UTC
    Thanks.

    For some reason though, I can't get it to work. Here's the code I'm using.

    #!/usr/bin/perl use warnings; use strict; my $input_data = "real.txt"; my $output_data = "output.txt"; open(FILE, $input_data) or die "Cannot open file $input_data because: +$!"; my @text = <FILE>; close(FILE); foreach my $line (@text) { if ( $line =~ /([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+)$ +/sm ) { print "CATEGORY $1\n\n"; print "KEYWORDS $2\n\n"; print "Summar text: $3\n\n"; print "Reference: $4\n\n"; print "Id: $5\n\n"; } else { print "none"; } }
    Here's the first few text blocks of my log file
    ABETALIPOPROTEINEMIA ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: etgranot@md.huji.ac.il) 41401 ABORTION ABORTION - Spontaneous, Caffeine - In a review of 15 epidemiologic studies, all of which suffered from important methodologic limitations, most reported a positive association between maternal caffeine intake and the risk of spontaneous abortion. However, the authors conclude that the evidence must be considered equivocal, since these biases would tend to overestimate any association. "Maternal Caffeine Consumption and Spontaneous Abortion: A Review of the Epidemiologic Evidence," Signorello LB, McLaughlin JK, Epidemiology, March 2004;15(2):229-239. (Address: Joseph K. McLaughlin, E-mail: jkm@iei.ws) 41684
    Any idea what I'm doing wrong?
      well. you just canīt apply a regex that is suppused to match multiple lines repeatedly to single lines. that wonīt ever work.

      this will do the job:
      #!/usr/bin/perl use warnings; use strict; my $infile = "input.txt"; my $outfile = "output.txt"; my $text; { local $/=undef; #set line separator locally to undef to read in th +e file as a whole open(FILE, $infile) or die "died opening $infile for input: $!\n"; $text = <FILE>; close(FILE); } open OUT, ">$outfile" or die "died opening $outfile for output: $!\n"; #repeatedly match searched text, while deleting found occurences #to avoid endless loop while ( $text =~ s/([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+ +)$//sm ) { print OUT "CATEGORY $1\n\n", "KEYWORDS $2\n\n", "Summar text: $3\n\n", "Reference: $4\n\n", "Id: $5\n\n", } close OUT;
      i suggest reading perlre

      Update:

      this one is better because it uses \G to avoid unneccessary backtracking
      #!/usr/bin/perl use warnings; use strict; my $infile = "input.txt"; my $outfile = "output.txt"; my $text; { local $/=undef; #set line separator locally to undef to read in th +e file as a whole open(FILE, $infile) or die "died opening $infile for input: $!\n"; $text = <FILE>; close(FILE); } open OUT, ">$outfile" or die "died opening $outfile for output: $!\n"; #repeatedly match searched text, using \G #to avoid endless loop while ( $text =~ m/\G.*?([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([ +0-9]+)$/gsm ) { print OUT "CATEGORY $1\n\n", "KEYWORDS $2\n\n", "Summar text: $3\n\n", "Reference: $4\n\n", "Id: $5\n\n", } close OUT;