I have a huge file, approximately 2MB, of text I need to break apart for ad campaigns.

You can see a sample of it below. There are blocks of text that need to be broken into {CATEGORY} - {keywords} - {summary text} {"reference"} {summary number}.

The category is ALWAYS in capital letters and is separated by a hyphen. This is where it gets tricky though. Sometimes the hyphen will have a space before it, sometimes it will have just one after it.. and sometimes there will be a space before AND after it. But a hyphen with one or two spaces (before, between or after) separates the CATEGORY from the keywords. The hyphens that can appear in the normal text will NEVER have spaces before or after the hyphen.

The summary text stops when it finds quoted words (which are the references) and last but not least is a whole number as the last part of the file that's the reference number.

I need a regex that will break apart each of these and store them into their own variable. Can someone help me with a working regex to do this?

A concern of mine is between some (not all) of the blocks of text that needs parsing are some header text lines as shown in the sample below. The first "ABETALIPOPROTEINEMIA" means nothing to us and it's just a category header. Would it be easier to read through the file, apply a regex that removes all header lines, then read the file again with the regex to break things apart into their proper group?

Any help would be much appreciated, regexes aren't my strong point.

ABETALIPOPROTEINEMIA ABETALIPOPROTEINEMIA - Vitamin A, Vitamin E - In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Granot E, Kohen R, Am J Clin Nutr, 2004;79:226- 230. (Address: Esther Granot, E-mail: essst@md.huji.ac.il) 41401
Would be broken into:
CATEGORY: ABETALIPOPROTEINEMIA Keywords: Vitamin A, Vitamin E Summar text: In a study of 10 subjects who were 3-25 years of age (mean age 14.6 years) who were diagnosed with abetalipoproteinemia during their first year of life and from then on received 100 mg of vitamin E/kg and 10,000-15,000 IU/day of vitamin A compared with 10 age-matched control subjects, levels of plasma carbonyls did not differ significantly between patients and controls. The lag phase of plasma oxidizability was 28.03 minutes in the treated subjects compared with 24.0 minutes in the healthy subjects. Cyclic voltammetry showed a peak potential of 330 mV in all the samples studied, which suggests that the same antioxidants were present in the plasma of the patients and the control subjects. The anodic current of the samples, which measures the concentrations of hydrophilic low-molecular-weight antioxidants, was 5.227 versus 5.38 uA in the patients and control subjects, respectively. These data suggest that enhanced oxidative stress is not apparent in the plasma of abetalipoproteinemia patients receiving long-term supplementation with vitamin A and E. It is noted that the neurologic and ophthalmologic symptoms of abetalipoproteinemia are believed to be, in part, caused by alpha- tocopherol deficiency. Reference: "Oxidative Stress in Abetalipoproteinemia Patients Receiving Long-Term Vitamin E and Vitamin A Supplementation," Id: 41401

In reply to Working with regexes by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.