in reply to Need help parsing ambiguously formatted data

Dear Geniuses, Gurus, Wizards and other Wise Ones....

I'm trying to process **an array of** data lines that don't always follow the rules. Sometimes one line is split into two, sometimes two lines are "conjoined." Here's a sample:

1. "Microsoft Corporation - DirectShow "
2. "Version 6.4.05.0809 * "
3. "Microsoft Corporation - Internet Server Version "
4. "4.02.0720 * Microsoft Corporation - Internet Explorer "
5. "Version 5.00.2014.200 * "
6. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
7. "Excel Viewer Version 8.0 * Connectivity Version 2.10.2309 * "

I have code which handles the first two lines (split), and code which handles the last line (conjoined). Where I'm having trouble is with lines three, four, and five. Line three is split, its tail is spliced to the front of line four, which is then split, with its tail as line five. IOW, line four contains the tail of line three and the head of line five.

Does anyone know of a data parsing module that could make sense of this jumble? The required output for the above lines would be:

1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
3. "Microsoft Corporation - Internet Explorer Version 5.00.2014.200 * "
4. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
5. "Excel Viewer Version 8.0 * "
6. "Connectivity Version 2.10.2309 * "

But what I actually end up with is:

1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
3. "Microsoft Corporation - Internet Explorer "
4. "Version 5.00.2014.200 * "
5. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
6. "Excel Viewer Version 8.0 * "
7. "Connectivity Version 2.10.2309 * "

As you can see, the signal value for end-of-line ACTUAL is " * ". I can't change the code that generates the data.

Thanks!

Replies are listed 'Best First'.
Re: Parsing, corrected (see ** . . . **)
by Ctrl-z (Friar) on Dec 01, 2004 at 21:27 UTC
    have you tried setting $INPUT_RECORD_SEPARATOR to "*" ?

    edit: wow, this threads confusing! Nothing to see here, move along...



    time was, I could move my arms like a bird and...