Capturing Multiple lined data with regex.

blackadder has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

In my attempts in trying to improve my understanding of Regex, I have the following data from a HTML dump;

1  adriaanf Europe Local    _Default Different Owner For Target Machin
+e HA050069  OYWVM1237  LN-CS/06 Technology XP DESKTOP Dell OptiPlex G
+X270 (0151) Pentium 4 (1 x 2793) 1023 38146 11147 N/A  N/A
 OYWVM1237  Technology LN-OY/02 VIRTUAL (OYWVH161) VMWare VMWare For D
+esktop Not Defined (1 x 3065) 767 20473 8209 23/11/2004 11:10:02  N/A
+  N/A
 N/A N/A 
2  adriaanf Europe Local    _Default Different Owner For Target Machin
+e HA050069  OYWVM1262  LN-CS/06 Technology XP DESKTOP Dell OptiPlex G
+X270 (0151) Pentium 4 (1 x 2793) 1023 38146 11147 N/A  N/A
 OYWVM1262  Technology LN-OY/06 VIRTUAL (OYWVH159) VMWare VMWare For D
+esktop Not Defined (1 x 3064) 767 20473 7800 07/12/2004 10:50:32  N/A
+  N/A
 N/A N/A 
5  adrianst Europe Local    ER_LN_WAR Different Owner For Target Machi
+ne CW041698  OYWVM1263  LN-CW/04 Research XP DESKTOP Compaq Evo D510 
+(07E8h) Small Form Factor Pentium 4 (1 x 2259) 511 38154 10740 N/A  N
+/A
 OYWVM1263  Technology LN-OY/02 VIRTUAL (OYWVH138) VMWare VMWare For D
+esktop Not Defined (1 x 3065) 767 20473 7788 06/12/2004 18:24:34  N/A
+  N/A
 N/A N/A 
6  adrianst Europe Local    ER_LN_WAR Different Owner For Target Machi
+ne CW041698  OYWVM1230  LN-CW/04 Research XP DESKTOP Compaq Evo D510 
+(07E8h) Small Form Factor Pentium 4 (1 x 2259) 511 38154 10740 N/A  N
+/A
 OYWVM1230  Technology LN-OY/06 VIRTUAL (OYWVH133) VMWare VMWare For D
+esktop Not Defined (1 x 3065) 767 20473 6921 06/12/2004 17:48:37  N/A
+  N/A
 N/A N/A
[download]

From that data, I need to grab it in a record form. Each record starts with a record number 1,2,3,...50000..etc

I am not sure on how to do it, so I started with this code;

#! c:/perl/bin/perl.exe -slw
$|++;
use strict;
use vars qw/%data/;

open (LST, "$ARGV[0]") or die "\n$0 Error => $^E\n";
chomp (my @unclean = <LST>);
print"size : $#unclean";
for (@unclean)
{
    if ($_ =~ /^\d+\s+/)
    {
        print "First 1 : $_\n";
        #print "Line 2
        #print "Line 3
        #print "Line 4
        print "____________________________________\n";
    }    
}
[download]

It does grab the first line of that data, but I am not sure on how to code the regex so that it grabs all lines until the next record number, where start of a new record begins.

I have spent some time trying all sorts of different combinations of regex to no avail. I would appreciate if any of you divine beings can inspire and guide me through this.

Thanks

UPDATE : The 'n/a's are empty field values reserved for dates. its not always that records end with n/a.

Blackadder

Comment on Capturing Multiple lined data with regex. Select or Download Code

Replies are listed 'Best First'.
Re: Capturing Multiple lined data with regex. by duff (Parson) on Dec 08, 2004 at 15:25 UTC
I'm not so sure you should be doing all of the work inside of a regular expression. Here are two ways to do what you're after: Example #1 `open(my $fh, "<", $ARGV[0]) or die "\n$0 Error => $^E\n"; my $data = do { local $/; <$fh> }; close $fh; my @records = split /(?=\n\d+\s+)/, $data; # Now each item of @records has the lines you're looking for` [download] Example #2 `open(my $fh, "<", $ARGV[0]) or die "\n$0 Error => $^E\n"; my (@records,$rec); while (<$fh>) { unless (/^\d+\s+/) { $rec .= $_; next } push @records, $rec if $rec; $rec = $_ } push @records, $rec if $rec; close $fh; # Now each item of @records has the lines you're looking for` [download] Can you guess which example I like best? :-) You can use various print statements if you don't want to populate an array of course. duff	[reply] [d/l] [select]
Re^2: Capturing Multiple lined data with regex. by blackadder (Hermit) on Dec 08, 2004 at 22:00 UTC
Kool stuff,...Thanks a lot. But let me guess, Example #1. right? :-) Blackadder	[reply]
Re: Capturing Multiple lined data with regex. by conrad (Beadle) on Dec 08, 2004 at 15:30 UTC
The best thing would be to set `$/` to some kind of record separator such that your `my @unclean = <LST>` actually sucks in one record per array entry rather than one line (`$/` is the input record separator, usually `"\n"`, so `<LST>` reads in a line at a time). However, without more information it's difficult to tell what a good value would be. It looks as if every record is separated by a line consisting of `"N/A N/A\n"`; if that's true, then try setting `$/ = "N/A N/A\n"` (`"\nN/A N/A\n"` might be safer — matches a line containing exactly `N/A N/A`, not simply ending in `N/A N/A`) and taking a look at what you get in `@unclean` (more info in `perlvar` docs). If that works then you might also want to investigate using `/s` and/or `/m` as modifiers on your regexp (these modify how regexps process newlines - more info in `perlop` docs). HTH…	[reply] [d/l] [select]
Re^2: Capturing Multiple lined data with regex. by blackadder (Hermit) on Dec 08, 2004 at 21:50 UTC
It all came flooding back to me! All them hours of studying! Reading this and this. Many thanks indeed Blackadder	[reply]