Dismas has asked for the wisdom of the Perl Monks concerning the following question:

Dear Geniuses, Gurus, Wizards and other Wise Ones....

I'm trying to process some data lines that don't always follow the rules. Sometimes one line is split into two, sometimes two lines are "conjoined." Here's a sample:

1. "Microsoft Corporation - DirectShow "
2. "Version 6.4.05.0809 * "
3. "Microsoft Corporation - Internet Server Version "
4. "4.02.0720 * Microsoft Corporation - Internet Explorer "
5. "Version 5.00.2014.200 * "
6. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
7. "Excel Viewer Version 8.0 * Connectivity Version 2.10.2309 * "

I have code which handles the first two lines (split), and code which handles the last line (conjoined). Where I'm having trouble is with lines three, four, and five. Line three is split, its tail is spliced to the front of line four, which is then split, with its tail as line five. IOW, line four contains the tail of line three and the head of line five.

Does anyone know of a data parsing module that could make sense of this jumble? The required output for the above lines would be:

1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
3. "Microsoft Corporation - Internet Explorer Version 5.00.2014.200 * "
4. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
5. "Excel Viewer Version 8.0 * "
6. "Connectivity Version 2.10.2309 * "

But what I actually end up with is:

1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
3. "Microsoft Corporation - Internet Explorer "
4. "Version 5.00.2014.200 * "
5. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
6. "Excel Viewer Version 8.0 * "
7. "Connectivity Version 2.10.2309 * "

As you can see, the signal value for end-of-line ACTUAL is " * ". I can't change the code that generates the data.

Thanks!

Edit by castaway - Retitle from "Data Parsing"

  • Comment on Need help parsing ambiguously formatted data

Replies are listed 'Best First'.
Re: Need help parsing ambiguously formatted data
by dragonchild (Archbishop) on Dec 01, 2004 at 20:50 UTC
    Why not change $\ (or is it $/ ... I mix them up) to be *, so that instead of Perl reading a line as terminated by \n, it reads it terminated by *. Then, you remove the \n and print it back out ...

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Thanks Dragonchild.

      But I mislead you--I get the data as an array--as I mention in my corrected post which should be appearing any minute now.

      Maybe I should think of revising how I read the data, redefining $/ (or $\) as you suggested....?

      Anyway, thanks for the thought--I'll look at it.

      Thanks again!

      Dismas

        my @records = map { $_ . ' *' } split /*/, join '', @messy_array; chop $records[-1];

        Presto!

        Update: ***Poof*** It's a dud. ;) See my update down below...


        Dave

        You could redefine how you read the data in. Or, you could
        sub rework_data { my $x = join '', @_ my $fh = IO::Scalar->new( \$x ); my @rebuilt_data; { local $\ = '*'; @rebuilt_data = <$fh>; } return @rebuilt_data; }

        I love treating arrays as filehandles. :-)

        (Oh, it's $\ ... I just remembered cause I keep writing print $foo, $/;, so it can't be that one. *grins*)

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Need help parsing ambiguously formatted data
by duff (Parson) on Dec 01, 2004 at 22:04 UTC

    I see you've already had some similar answers, but here's how I'd do it assuming I understand you correctly:

    #!/usr/bin/perl + use strict; use warnings; + chomp(my @lines = <DATA>); @lines = split /(?<= [*] )/, join '', @lines; # important bit here : +-) print map { "$_\n" } @lines; __DATA__ Microsoft Corporation - DirectShow Version 6.4.05.0809 * Microsoft Corporation - Internet Server Version 4.02.0720 * Microsoft Corporation - Internet Explorer Version 5.00.2014.200 * Microsoft Corporation - Windows Installer - Version 2.0.2 * Excel Viewer Version 8.0 * Connectivity Version 2.10.2309 *
Re: Need help parsing ambiguously formatted data
by ikegami (Patriarch) on Dec 01, 2004 at 21:36 UTC

    How much data is there? Can you do create one string from the array?

    @lines = join('', @lines) =~ /((?:(?!\s\*\s).)*\s\*\s)/g;
    or, if you don't mind losing the stars,
    @lines = split(/\s\*\s/, join('', @lines));

    Test case

Parsing, corrected (see ** . . . **)
by Dismas (Acolyte) on Dec 01, 2004 at 20:55 UTC
    Dear Geniuses, Gurus, Wizards and other Wise Ones....

    I'm trying to process **an array of** data lines that don't always follow the rules. Sometimes one line is split into two, sometimes two lines are "conjoined." Here's a sample:

    1. "Microsoft Corporation - DirectShow "
    2. "Version 6.4.05.0809 * "
    3. "Microsoft Corporation - Internet Server Version "
    4. "4.02.0720 * Microsoft Corporation - Internet Explorer "
    5. "Version 5.00.2014.200 * "
    6. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
    7. "Excel Viewer Version 8.0 * Connectivity Version 2.10.2309 * "

    I have code which handles the first two lines (split), and code which handles the last line (conjoined). Where I'm having trouble is with lines three, four, and five. Line three is split, its tail is spliced to the front of line four, which is then split, with its tail as line five. IOW, line four contains the tail of line three and the head of line five.

    Does anyone know of a data parsing module that could make sense of this jumble? The required output for the above lines would be:

    1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
    2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
    3. "Microsoft Corporation - Internet Explorer Version 5.00.2014.200 * "
    4. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
    5. "Excel Viewer Version 8.0 * "
    6. "Connectivity Version 2.10.2309 * "

    But what I actually end up with is:

    1. "Microsoft Corporation - DirectShow Version 6.4.05.0809 * "
    2. "Microsoft Corporation - Internet Server Version 4.02.0720 * "
    3. "Microsoft Corporation - Internet Explorer "
    4. "Version 5.00.2014.200 * "
    5. "Microsoft Corporation - Windows Installer - Version 2.0.2 * "
    6. "Excel Viewer Version 8.0 * "
    7. "Connectivity Version 2.10.2309 * "

    As you can see, the signal value for end-of-line ACTUAL is " * ". I can't change the code that generates the data.

    Thanks!
      have you tried setting $INPUT_RECORD_SEPARATOR to "*" ?

      edit: wow, this threads confusing! Nothing to see here, move along...



      time was, I could move my arms like a bird and...
Re: Need help parsing ambiguously formatted data
by Anonymous Monk on Dec 01, 2004 at 21:34 UTC

    You can change $\ or you can use split/a regex (after joining the string, that is if the data isn't too big).

Re: Need help parsing ambiguously formatted data
by periapt (Hermit) on Dec 02, 2004 at 13:38 UTC
    Alternately, if you are constrained to read in just a few lines at a time
    use strict; use warnings; use diagnostics; my @parsedline = (); my $dataline = ''; while(<DATA>){ chomp; $dataline .= $_; if(/\s*[*]\s*$/){ @parsedline = split /\s*[*]\s*/,$dataline; print join("\n",@parsedline),"\n"; @parsedline = (); $dataline = ''; } } exit; __DATA__ Microsoft Corporation - DirectShow Version 6.4.05.0809 * Microsoft Corporation - Internet Server Version 4.02.0720 * Microsoft Corporation - Internet Explorer Version 5.00.2014.200 * Microsoft Corporation - Windows Installer - Version 2.0.2 * Excel Viewer Version 8.0 * Connectivity Version 2.10.2309 *

    PJ
    use strict; use warnings; use diagnostics;