Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
You can see a sample of it below. There are blocks of text that need to be broken into {CATEGORY} - {keywords} - {summary text} {"reference"} {summary number}.
The category is ALWAYS in capital letters and is separated by a hyphen. This is where it gets tricky though. Sometimes the hyphen will have a space before it, sometimes it will have just one after it.. and sometimes there will be a space before AND after it. But a hyphen with one or two spaces (before, between or after) separates the CATEGORY from the keywords. The hyphens that can appear in the normal text will NEVER have spaces before or after the hyphen.
The summary text stops when it finds quoted words (which are the references) and last but not least is a whole number as the last part of the file that's the reference number.
I need a regex that will break apart each of these and store them into their own variable. Can someone help me with a working regex to do this?
A concern of mine is between some (not all) of the blocks of text that needs parsing are some header text lines as shown in the sample below. The first "ABETALIPOPROTEINEMIA" means nothing to us and it's just a category header. Would it be easier to read through the file, apply a regex that removes all header lines, then read the file again with the regex to break things apart into their proper group?
Any help would be much appreciated, regexes aren't my strong point.
Would be broken into:ABETALIPOPROTEINEMIA ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: essst@md.huji.ac.il) 41401
CATEGORY: ABETALIPOPROTEINEMIA Keywords: Vitamin A, Vitamin E Summar text: In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. Reference: "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Id: 41401
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Working with regexes
by Anonymous Monk on Jan 06, 2005 at 19:57 UTC | |
|
Re: Working with regexes
by holli (Abbot) on Jan 06, 2005 at 20:11 UTC | |
by Anonymous Monk on Jan 06, 2005 at 21:32 UTC | |
by holli (Abbot) on Jan 06, 2005 at 22:25 UTC | |
|
Re: Working with regexes
by jbrugger (Parson) on Jan 06, 2005 at 22:19 UTC |