Need Record Parsing Advice

awohld has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a daily log file with tens of millions of records like:

======================================================================
Record: 9868943   Version:  2   Timestamp: Sat Feb 18 22:33:43 2006
Primary (Reporting) ID: 240  Level: 2 Group: 1  Reg: no  Event:  51748
Keep: 1  ID-Node: 0x2017  Inverse: 25
Secondary ID:
Keep: 1   ID:  68   Inverse: 23   
Keep: 1   ID: 240   Inverse: 27   
Keep: 1   ID: 368   Inverse: 30   
======================================================================
Record: 9868944   Version:  2   Timestamp: Sat Feb 18 22:33:44 2006
Primary (Reporting) ID: 67  Level: 9 Group: 0  Reg: no  Event:  51749
Keep: 1  ID-Node: 0xA087  Inverse: 55
Secondary ID:
Keep: 1   ID:  62   Inverse: 73   
   
======================================================================
+=
[download]

I need to get the "Primary (Reporting) ID", "Level", "Group", "ID-Node", and "Inverse" values. I also need to get n-to-4 of the "Secondary ID" fields stripped out of each record which I will then insert into a DB.

I thought about using a regexp to replace all spaces with a comma and the "===" lines with a "\n" and write it to a temp file.

Then I'd iterate line by line over the temp file splitting the CSV string into an array and then get my data elements like that. Similar to what I saw at Parsing multi-line records

Does that sound like a good start or should I be trying another method?

Comment on Need Record Parsing Advice Download Code

Replies are listed 'Best First'.
Re: Need Record Parsing Advice by Samy_rio (Vicar) on Mar 02, 2006 at 09:04 UTC
Hi, If i understood your question correctly the following will help you. use strict; use warnings; local $/= "\n="; while (my $line = <DATA>) { print "\n------------------------------\n"; if ($line =~ m/Primary $Reporting$ ID\s:\s((?:(?!Level).))Level\s +:\s((?:(?!Group).))Group\s:\s((?:(?!Reg).))/si) { print "Primary (Reporting) ID : $1\nLevel : $2\nGroup : $3"; } if ($line =~ m/ID-Node\s:\s((?:(?!Inverse).))Inverse\s:\s((?:(?!\ +n).))/si) { print "\nID-Node : $1\nInverse : $2\n"; } print "\nSecondary ID :"; while ($line =~ m/(?<!Primary $Reporting$ )(?<!Secondary )ID\s:\s( +(?:(?!Inverse).))Inverse\s:\s((?:(?!\n).)*)/gsi) { print "\nID : $1\tInverse : $2"; } print "\n------------------------------\n"; } __DATA__ ====================================================================== Record: 9868943 Version: 2 Timestamp: Sat Feb 18 22:33:43 2006 Primary (Reporting) ID: 240 Level: 2 Group: 1 Reg: no Event: 51748 Keep: 1 ID-Node: 0x2017 Inverse: 25 Secondary ID: Keep: 1 ID: 68 Inverse: 23 Keep: 1 ID: 240 Inverse: 27 Keep: 1 ID: 368 Inverse: 30 ====================================================================== Record: 9868944 Version: 2 Timestamp: Sat Feb 18 22:33:44 2006 Primary (Reporting) ID: 67 Level: 9 Group: 0 Reg: no Event: 51749 Keep: 1 ID-Node: 0xA087 Inverse: 55 Secondary ID: Keep: 1 ID: 62 Inverse: 73 Output is : ------------------------------ Primary (Reporting) ID : 240 Level : 2 Group : 1 ID-Node : 0x2017 Inverse : 25 Secondary ID : ID : 68 Inverse : 23 ID : 240 Inverse : 27 ID : 368 Inverse : 30 ------------------------------ ------------------------------ Primary (Reporting) ID : 67 Level : 9 Group : 0 ID-Node : 0xA087 Inverse : 55 Secondary ID : ID : 62 Inverse : 73 ------------------------------ [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re^2: Need Record Parsing Advice by awohld (Hermit) on Mar 04, 2006 at 07:01 UTC
I got your sample working, but I had to modify the data ouputs on the 0 to N field. The problem is I can't get the first line of the variable field to match correctly. Why is "Secondary Sector Information" and "Slot" showing up in the output and how can I get it out? use strict; use warnings; local $/= "\n="; while (my $line = <DATA>) { print "\n------------------------------\n"; if ($line =~ m/Primary $Reporting$ Cp\s:\s((?:(?!Set).))Set\s:\s +((?:(?!Car).))Car\s:\s((?:(?!Ref).))/si) { print "Cp:$1 - Set:$2 - Car:$3"; } if ($line =~ m/Phase\s:\s((?:(?!Strength).))Strength\s:\s((?:(?!\ +n).))/si) { print "\nPhase:$1 - Strength:$2\n"; } while ($line =~ m/(?<!Primary $Reporting$ )(?<!Secondary Sector)Keep +\s:\s((?:(?!offset).))offset\s:\s((?:(?!Strength).))Strength\s +:\s((?:(?!Ref).))/gsi) { print "Keep:$1 - offset:$2 - Strength:$3\n"; } print "\n------------------------------\n"; } __DATA__ ==================================================================== Record: 9851329 Version: 2 Timestamp: Sat Feb 11 22:39:43 2006 Primary (Reporting) Cp: 113 Set: 2 Car: 1 Ref: yes Event: 9922 Missing P: Keep: 1 Phase: 0x2fdf Strength: 24 Secondary Sector Information: ==================================================================== Record: 9851330 Version: 2 Timestamp: Sat Feb 11 22:39:43 2006 Primary (Reporting) Cp: 115 Set: 1 Car: 2 Ref: yes Event: 9923 Missing P: Keep: 1 Phase: 0x7d10 Strength: 31 Secondary Sector Information: Slot 1: Keep: 1 offset: 391 Strength: 27 Ref: no Slot 2: Keep: 1 offset: 325 Strength: 38 Ref: no [download] Here is what my incorrect output looks like: `------------------------------ Cp:113 - Set:2 - Car:1 Phase:0x2fdf - Strength:24 ------------------------------ ------------------------------ Cp:115 - Set:1 - Car:2 Phase:0x7d10 - Strength:31 Keep:1 Phase: 0x7d10 Strength: 31 Secondary Sector Information: Slot 1: Keep: 1 - offset:391 - Strength:27 Keep:1 - offset:325 - Strength:38 ------------------------------` [download] And it should look like this `------------------------------ Cp:113 - Set:2 - Car:1 Phase:0x2fdf - Strength:24 ------------------------------ ------------------------------ Cp:115 - Set:1 - Car:2 Phase:0x7d10 - Strength:31 Keep:1 Phase: 0x7d10 Strength: 31 Keep:1 - Pn_offset:391 - Strength:27 Keep:1 - Pn_offset:325 - Strength:38 ------------------------------` [download]	[reply] [d/l] [select]
Re: Need Record Parsing Advice by zer (Deacon) on Mar 02, 2006 at 09:14 UTC
you should only need regex for this with some joining and hashes search for the name of primary reporting as it has a 2 word tag as well as secondary. Remove the space. search the list using the '---''s as record delimitors. Split the list so all spaces are new lines. Now each line should be a pairing of name and value. Search up to the first ':' and that is the name, then the remainder is the value. Throw it into a hash. Now you can draw any value you need... throw it into a file or what not	[reply]
Re: Need Record Parsing Advice by unobe (Scribe) on Mar 03, 2006 at 08:30 UTC
I'd use Tie::File and set the record separator. Then use some multiline regexes. I tested your data out on the following code, and it worked like a charm. use strict; use warnings; use Tie::File; my ($record, @array, %data) = (0); my $file = 'whatever'; tie @array, 'Tie::File', $file, recsep => '=' x 30 . "\n"; while ($array[$record]) { $record++, next unless $array[$record] =~ m/^[^=]/; @{$data{$record}}{'primary', 'level', 'group', 'id_node', 'inverse', 'secondaries'} = $array[$record] =~ m/# get primary id, level, and group Primary[^:]+ \D+(\d+) \D+(\d+) \D+(\d+) # then the id-node and inverse (?:[^:]+:){4} \D+(\w+) \D+(\d+) # skip 'Secondary ID' line [^K]+ # and grab the rest until end of record ([^=]+) /gxms; $record++; } for my $rec (keys %data) { for (keys %{$data{$rec}} ) { print $_ . ' is set to ' . $data{$rec}{$_} . "\n"; } } [download]	[reply] [d/l]