blackadder has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

In my attempts in trying to improve my understanding of Regex, I have the following data from a HTML dump;
1 adriaanf Europe Local _Default Different Owner For Target Machin +e HA050069 OYWVM1237 LN-CS/06 Technology XP DESKTOP Dell OptiPlex G +X270 (0151) Pentium 4 (1 x 2793) 1023 38146 11147 N/A N/A OYWVM1237 Technology LN-OY/02 VIRTUAL (OYWVH161) VMWare VMWare For D +esktop Not Defined (1 x 3065) 767 20473 8209 23/11/2004 11:10:02 N/A + N/A N/A N/A 2 adriaanf Europe Local _Default Different Owner For Target Machin +e HA050069 OYWVM1262 LN-CS/06 Technology XP DESKTOP Dell OptiPlex G +X270 (0151) Pentium 4 (1 x 2793) 1023 38146 11147 N/A N/A OYWVM1262 Technology LN-OY/06 VIRTUAL (OYWVH159) VMWare VMWare For D +esktop Not Defined (1 x 3064) 767 20473 7800 07/12/2004 10:50:32 N/A + N/A N/A N/A 5 adrianst Europe Local ER_LN_WAR Different Owner For Target Machi +ne CW041698 OYWVM1263 LN-CW/04 Research XP DESKTOP Compaq Evo D510 +(07E8h) Small Form Factor Pentium 4 (1 x 2259) 511 38154 10740 N/A N +/A OYWVM1263 Technology LN-OY/02 VIRTUAL (OYWVH138) VMWare VMWare For D +esktop Not Defined (1 x 3065) 767 20473 7788 06/12/2004 18:24:34 N/A + N/A N/A N/A 6 adrianst Europe Local ER_LN_WAR Different Owner For Target Machi +ne CW041698 OYWVM1230 LN-CW/04 Research XP DESKTOP Compaq Evo D510 +(07E8h) Small Form Factor Pentium 4 (1 x 2259) 511 38154 10740 N/A N +/A OYWVM1230 Technology LN-OY/06 VIRTUAL (OYWVH133) VMWare VMWare For D +esktop Not Defined (1 x 3065) 767 20473 6921 06/12/2004 17:48:37 N/A + N/A N/A N/A
From that data, I need to grab it in a record form. Each record starts with a record number 1,2,3,...50000..etc

I am not sure on how to do it, so I started with this code;
#! c:/perl/bin/perl.exe -slw $|++; use strict; use vars qw/%data/; open (LST, "$ARGV[0]") or die "\n$0 Error => $^E\n"; chomp (my @unclean = <LST>); print"size : $#unclean"; for (@unclean) { if ($_ =~ /^\d+\s+/) { print "First 1 : $_\n"; #print "Line 2 #print "Line 3 #print "Line 4 print "____________________________________\n"; } }
It does grab the first line of that data, but I am not sure on how to code the regex so that it grabs all lines until the next record number, where start of a new record begins.

I have spent some time trying all sorts of different combinations of regex to no avail. I would appreciate if any of you divine beings can inspire and guide me through this.

Thanks

UPDATE : The 'n/a's are empty field values reserved for dates. its not always that records end with n/a.
Blackadder

Replies are listed 'Best First'.
Re: Capturing Multiple lined data with regex.
by duff (Parson) on Dec 08, 2004 at 15:25 UTC

    I'm not so sure you should be doing all of the work inside of a regular expression. Here are two ways to do what you're after:

    Example #1

    open(my $fh, "<", $ARGV[0]) or die "\n$0 Error => $^E\n"; my $data = do { local $/; <$fh> }; close $fh; my @records = split /(?=\n\d+\s+)/, $data; # Now each item of @records has the lines you're looking for
    Example #2
    open(my $fh, "<", $ARGV[0]) or die "\n$0 Error => $^E\n"; my (@records,$rec); while (<$fh>) { unless (/^\d+\s+/) { $rec .= $_; next } push @records, $rec if $rec; $rec = $_ } push @records, $rec if $rec; close $fh; # Now each item of @records has the lines you're looking for

    Can you guess which example I like best? :-)

    You can use various print statements if you don't want to populate an array of course.

      Kool stuff,...Thanks a lot.

      But let me guess, Example #1. right?

      :-)
      Blackadder
Re: Capturing Multiple lined data with regex.
by conrad (Beadle) on Dec 08, 2004 at 15:30 UTC

    The best thing would be to set $/ to some kind of record separator such that your my @unclean = <LST> actually sucks in one record per array entry rather than one line ($/ is the input record separator, usually "\n", so <LST> reads in a line at a time). However, without more information it's difficult to tell what a good value would be.

    It looks as if every record is separated by a line consisting of "N/A N/A\n"; if that's true, then try setting $/ = "N/A N/A\n" ("\nN/A N/A\n" might be safer — matches a line containing exactly N/A N/A, not simply ending in N/A N/A) and taking a look at what you get in @unclean (more info in perlvar docs). If that works then you might also want to investigate using /s and/or /m as modifiers on your regexp (these modify how regexps process newlines - more info in perlop docs).

    HTH…

      It all came flooding back to me! All them hours of studying! Reading this and this.

      Many thanks indeed
      Blackadder