awohld has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a daily log file with tens of millions of records like:
====================================================================== Record: 9868943 Version: 2 Timestamp: Sat Feb 18 22:33:43 2006 Primary (Reporting) ID: 240 Level: 2 Group: 1 Reg: no Event: 51748 Keep: 1 ID-Node: 0x2017 Inverse: 25 Secondary ID: Keep: 1 ID: 68 Inverse: 23 Keep: 1 ID: 240 Inverse: 27 Keep: 1 ID: 368 Inverse: 30 ====================================================================== Record: 9868944 Version: 2 Timestamp: Sat Feb 18 22:33:44 2006 Primary (Reporting) ID: 67 Level: 9 Group: 0 Reg: no Event: 51749 Keep: 1 ID-Node: 0xA087 Inverse: 55 Secondary ID: Keep: 1 ID: 62 Inverse: 73 ====================================================================== +=

I need to get the "Primary (Reporting) ID", "Level", "Group", "ID-Node", and "Inverse" values. I also need to get n-to-4 of the "Secondary ID" fields stripped out of each record which I will then insert into a DB.

I thought about using a regexp to replace all spaces with a comma and the "===" lines with a "\n" and write it to a temp file.

Then I'd iterate line by line over the temp file splitting the CSV string into an array and then get my data elements like that. Similar to what I saw at Parsing multi-line records

Does that sound like a good start or should I be trying another method?

Replies are listed 'Best First'.
Re: Need Record Parsing Advice
by Samy_rio (Vicar) on Mar 02, 2006 at 09:04 UTC

    Hi, If i understood your question correctly the following will help you.

    use strict; use warnings; local $/= "\n="; while (my $line = <DATA>) { print "\n------------------------------\n"; if ($line =~ m/Primary \(Reporting\) ID\s*:\s*((?:(?!Level).)*)Level\s +*:\s*((?:(?!Group).)*)Group\s*:\s*((?:(?!Reg).)*)/si) { print "Primary (Reporting) ID : $1\nLevel : $2\nGroup : $3"; } if ($line =~ m/ID-Node\s*:\s*((?:(?!Inverse).)*)Inverse\s*:\s*((?:(?!\ +n).)*)/si) { print "\nID-Node : $1\nInverse : $2\n"; } print "\nSecondary ID :"; while ($line =~ m/(?<!Primary \(Reporting\) )(?<!Secondary )ID\s*:\s*( +(?:(?!Inverse).)*)Inverse\s*:\s*((?:(?!\n).)*)/gsi) { print "\nID : $1\tInverse : $2"; } print "\n------------------------------\n"; } __DATA__ ====================================================================== Record: 9868943 Version: 2 Timestamp: Sat Feb 18 22:33:43 2006 Primary (Reporting) ID: 240 Level: 2 Group: 1 Reg: no Event: 51748 Keep: 1 ID-Node: 0x2017 Inverse: 25 Secondary ID: Keep: 1 ID: 68 Inverse: 23 Keep: 1 ID: 240 Inverse: 27 Keep: 1 ID: 368 Inverse: 30 ====================================================================== Record: 9868944 Version: 2 Timestamp: Sat Feb 18 22:33:44 2006 Primary (Reporting) ID: 67 Level: 9 Group: 0 Reg: no Event: 51749 Keep: 1 ID-Node: 0xA087 Inverse: 55 Secondary ID: Keep: 1 ID: 62 Inverse: 73 Output is : ------------------------------ Primary (Reporting) ID : 240 Level : 2 Group : 1 ID-Node : 0x2017 Inverse : 25 Secondary ID : ID : 68 Inverse : 23 ID : 240 Inverse : 27 ID : 368 Inverse : 30 ------------------------------ ------------------------------ Primary (Reporting) ID : 67 Level : 9 Group : 0 ID-Node : 0xA087 Inverse : 55 Secondary ID : ID : 62 Inverse : 73 ------------------------------

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

      I got your sample working, but I had to modify the data ouputs on the 0 to N field. The problem is I can't get the first line of the variable field to match correctly.

      Why is "Secondary Sector Information" and "Slot" showing up in the output and how can I get it out?
      use strict; use warnings; local $/= "\n="; while (my $line = <DATA>) { print "\n------------------------------\n"; if ($line =~ m/Primary \(Reporting\) Cp\s*:\s*((?:(?!Set).)*)Set\s*:\s +*((?:(?!Car).)*)Car\s*:\s*((?:(?!Ref).)*)/si) { print "Cp:$1 - Set:$2 - Car:$3"; } if ($line =~ m/Phase\s*:\s*((?:(?!Strength).)*)Strength\s*:\s*((?:(?!\ +n).)*)/si) { print "\nPhase:$1 - Strength:$2\n"; } while ($line =~ m/(?<!Primary \(Reporting\) )(?<!Secondary Sector)Keep +\s*:\s*((?:(?!offset).)*)offset\s*:\s*((?:(?!Strength).)*)Strength\s* +:\s*((?:(?!Ref).)*)/gsi) { print "Keep:$1 - offset:$2 - Strength:$3\n"; } print "\n------------------------------\n"; } __DATA__ ==================================================================== Record: 9851329 Version: 2 Timestamp: Sat Feb 11 22:39:43 2006 Primary (Reporting) Cp: 113 Set: 2 Car: 1 Ref: yes Event: 9922 Missing P: Keep: 1 Phase: 0x2fdf Strength: 24 Secondary Sector Information: ==================================================================== Record: 9851330 Version: 2 Timestamp: Sat Feb 11 22:39:43 2006 Primary (Reporting) Cp: 115 Set: 1 Car: 2 Ref: yes Event: 9923 Missing P: Keep: 1 Phase: 0x7d10 Strength: 31 Secondary Sector Information: Slot 1: Keep: 1 offset: 391 Strength: 27 Ref: no Slot 2: Keep: 1 offset: 325 Strength: 38 Ref: no

      Here is what my incorrect output looks like:
      ------------------------------ Cp:113 - Set:2 - Car:1 Phase:0x2fdf - Strength:24 ------------------------------ ------------------------------ Cp:115 - Set:1 - Car:2 Phase:0x7d10 - Strength:31 Keep:1 Phase: 0x7d10 Strength: 31 Secondary Sector Information: Slot 1: Keep: 1 - offset:391 - Strength:27 Keep:1 - offset:325 - Strength:38 ------------------------------


      And it should look like this

      ------------------------------ Cp:113 - Set:2 - Car:1 Phase:0x2fdf - Strength:24 ------------------------------ ------------------------------ Cp:115 - Set:1 - Car:2 Phase:0x7d10 - Strength:31 Keep:1 Phase: 0x7d10 Strength: 31 Keep:1 - Pn_offset:391 - Strength:27 Keep:1 - Pn_offset:325 - Strength:38 ------------------------------
Re: Need Record Parsing Advice
by zer (Deacon) on Mar 02, 2006 at 09:14 UTC
    you should only need regex for this with some joining and hashes

    search for the name of primary reporting as it has a 2 word tag as well as secondary. Remove the space.

    search the list using the '---''s as record delimitors. Split the list so all spaces are new lines. Now each line should be a pairing of name and value. Search up to the first ':' and that is the name, then the remainder is the value.

    Throw it into a hash. Now you can draw any value you need... throw it into a file or what not

Re: Need Record Parsing Advice
by unobe (Scribe) on Mar 03, 2006 at 08:30 UTC
    I'd use Tie::File and set the record separator. Then use some multiline regexes. I tested your data out on the following code, and it worked like a charm.
    use strict; use warnings; use Tie::File; my ($record, @array, %data) = (0); my $file = 'whatever'; tie @array, 'Tie::File', $file, recsep => '=' x 30 . "\n"; while ($array[$record]) { $record++, next unless $array[$record] =~ m/^[^=]/; @{$data{$record}}{'primary', 'level', 'group', 'id_node', 'inverse', 'secondaries'} = $array[$record] =~ m/# get primary id, level, and group Primary[^:]+ \D+(\d+) \D+(\d+) \D+(\d+) # then the id-node and inverse (?:[^:]+:){4} \D+(\w+) \D+(\d+) # skip 'Secondary ID' line [^K]+ # and grab the rest until end of record ([^=]+) /gxms; $record++; } for my $rec (keys %data) { for (keys %{$data{$rec}} ) { print $_ . ' is set to ' . $data{$rec}{$_} . "\n"; } }