Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge file, approximately 2MB, of text I need to break apart for ad campaigns.

You can see a sample of it below. There are blocks of text that need to be broken into {CATEGORY} - {keywords} - {summary text} {"reference"} {summary number}.

The category is ALWAYS in capital letters and is separated by a hyphen. This is where it gets tricky though. Sometimes the hyphen will have a space before it, sometimes it will have just one after it.. and sometimes there will be a space before AND after it. But a hyphen with one or two spaces (before, between or after) separates the CATEGORY from the keywords. The hyphens that can appear in the normal text will NEVER have spaces before or after the hyphen.

The summary text stops when it finds quoted words (which are the references) and last but not least is a whole number as the last part of the file that's the reference number.

I need a regex that will break apart each of these and store them into their own variable. Can someone help me with a working regex to do this?

A concern of mine is between some (not all) of the blocks of text that needs parsing are some header text lines as shown in the sample below. The first "ABETALIPOPROTEINEMIA" means nothing to us and it's just a category header. Would it be easier to read through the file, apply a regex that removes all header lines, then read the file again with the regex to break things apart into their proper group?

Any help would be much appreciated, regexes aren't my strong point.

ABETALIPOPROTEINEMIA ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: essst@md.huji.ac.il) 41401
Would be broken into:
CATEGORY: ABETALIPOPROTEINEMIA Keywords: Vitamin A, Vitamin E Summar text: In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. Reference: "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Id: 41401

Replies are listed 'Best First'.
Re: Working with regexes
by Anonymous Monk on Jan 06, 2005 at 19:57 UTC
    Oh, and by the way the ID number may or may not be on it's own line. So I guess that makes it a lot more difficult.
Re: Working with regexes
by holli (Abbot) on Jan 06, 2005 at 20:11 UTC
    my $text=<<"TEXT"; ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: essst@md.huji.ac.il) 41401 TEXT if ( $text =~ /([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+)$/s +m ) { print "CATEGORY $1\n\n"; print "KEYWORDS $2\n\n"; print "Summar text: $3\n\n"; print "Reference: $4\n\n"; print "Id: $5\n\n"; }
      Thanks.

      For some reason though, I can't get it to work. Here's the code I'm using.

      #!/usr/bin/perl use warnings; use strict; my $input_data = "real.txt"; my $output_data = "output.txt"; open(FILE, $input_data) or die "Cannot open file $input_data because: +$!"; my @text = <FILE>; close(FILE); foreach my $line (@text) { if ( $line =~ /([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+)$ +/sm ) { print "CATEGORY $1\n\n"; print "KEYWORDS $2\n\n"; print "Summar text: $3\n\n"; print "Reference: $4\n\n"; print "Id: $5\n\n"; } else { print "none"; } }
      Here's the first few text blocks of my log file
      ABETALIPOPROTEINEMIA ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: etgranot@md.huji.ac.il) 41401 ABORTION ABORTION - Spontaneous, Caffeine - In a review of 15 epidemiologic studies, all of which suffered from important methodologic limitations, most reported a positive association between maternal caffeine intake and the risk of spontaneous abortion. However, the authors conclude that the evidence must be considered equivocal, since these biases would tend to overestimate any association. "Maternal Caffeine Consumption and Spontaneous Abortion: A Review of the Epidemiologic Evidence," Signorello LB, McLaughlin JK, Epidemiology, March 2004;15(2):229-239. (Address: Joseph K. McLaughlin, E-mail: jkm@iei.ws) 41684
      Any idea what I'm doing wrong?
        well. you just canīt apply a regex that is suppused to match multiple lines repeatedly to single lines. that wonīt ever work.

        this will do the job:
        #!/usr/bin/perl use warnings; use strict; my $infile = "input.txt"; my $outfile = "output.txt"; my $text; { local $/=undef; #set line separator locally to undef to read in th +e file as a whole open(FILE, $infile) or die "died opening $infile for input: $!\n"; $text = <FILE>; close(FILE); } open OUT, ">$outfile" or die "died opening $outfile for output: $!\n"; #repeatedly match searched text, while deleting found occurences #to avoid endless loop while ( $text =~ s/([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([0-9]+ +)$//sm ) { print OUT "CATEGORY $1\n\n", "KEYWORDS $2\n\n", "Summar text: $3\n\n", "Reference: $4\n\n", "Id: $5\n\n", } close OUT;
        i suggest reading perlre

        Update:

        this one is better because it uses \G to avoid unneccessary backtracking
        #!/usr/bin/perl use warnings; use strict; my $infile = "input.txt"; my $outfile = "output.txt"; my $text; { local $/=undef; #set line separator locally to undef to read in th +e file as a whole open(FILE, $infile) or die "died opening $infile for input: $!\n"; $text = <FILE>; close(FILE); } open OUT, ">$outfile" or die "died opening $outfile for output: $!\n"; #repeatedly match searched text, using \G #to avoid endless loop while ( $text =~ m/\G.*?([A-Z]+) *- *([^-]+) *- *([^"]+)"([^"]+)".+?([ +0-9]+)$/gsm ) { print OUT "CATEGORY $1\n\n", "KEYWORDS $2\n\n", "Summar text: $3\n\n", "Reference: $4\n\n", "Id: $5\n\n", } close OUT;
Re: Working with regexes
by jbrugger (Parson) on Jan 06, 2005 at 22:19 UTC
    did you try to chomp the line before parsing?