Boldra has asked for the wisdom of the Perl Monks concerning the following question:

Ok, first here's my data;
   record1
    field2 2345
   record2
   record3
    field1 GAGGA
    field2 7848
    field2a 5m
Each field is slightly indented, with a record always beginning at column 4.

What I'm trying to do is break this up into records, so I can process the fields. Here's the basic algorithm I'm playing with:
foreach $record ($formatted_data =~ m/(^ {3}\w.*)/mg) { &process_record($record); }
Which only returns the first line of a given record, as though my "." wasn't matching \n.

I gave up trying to split the data, because I need that first \w. I also abandoned using [\s\S] instead of "." , since that was too greedy; yet not greedy enough if I subdued it with a "?" thus: m/(^ {3}\w[\s\S]*?)/mg.

There must be a middle ground here somewhere... I can feel I'm close...

Replies are listed 'Best First'.
Re: Regex not greedy enough
by merlyn (Sage) on Nov 17, 2000 at 19:04 UTC
    @records = split /(^ {3}\w.*\n)/m, $input;
    should give you:
    "", " record1blah\n", " dataforrecord1\n moredataforrecord1\n", " record2blah\n", " dataforrecord2\n moredataforrecord2\n", ...
    You'll need to toss that first empty element... it's the part of the string leading up to your first record.

    -- Randal L. Schwartz, Perl hacker

      Now that is handy! I didn't know you could capture the split regex stuff that way.

      Thanks!

Re: Regex not greedy enough
by snax (Hermit) on Nov 17, 2000 at 18:16 UTC
Re: Regex not greedy enough
by japhy (Canon) on Nov 17, 2000 at 18:19 UTC
    Assuming you have the entire thing in the string, I would suggest split()ing with lookahead:
    # split RIGHT BEFORE a \n followed by 'record' @records = split /(?=\nrecord)/, $data;


    $monks{japhy}++ while $posting;
Re: Regex not greedy enough
by Boldra (Curate) on Nov 17, 2000 at 18:46 UTC
    Thanks for the comments;
      Snax - you're right, I was using the opposite switch to the one I meant, but I still have the same greed problems.
      Japhy - This is new to me and it looks like the kind of solution I was after. However ?= matches nothing as (?=^ {3}\w), and I can't use \n, since then I skip my first record.

    Any more ideas?
    BTW: the 'record1' string is actually the first field of the record; it could be anything beginning with a \w

    oh yeah, here's my test source:
    #!/usr/bin/perl -w use strict; my($infile,@records); while(<DATA>) {$infile.=$;} @records = (split(/(?=^ {3}\w)/,$infile); #returns whole list #@records = ($infile =~ m/(^ {3}\w.*?)/sg); #returns only up to \w print join("\n========\n",@records); __DATA__ record1 field2 2345 record2 record3 field1 GAGGA field2 7848 field2a 5m
      It won't match (?=^ANYTHING) at any place but the very beginning of the string unless you have the /m modifier on in the regex, which allows ^ to match after newlines.

      Ohhhhhh. I didn't think you meant ALL the text was indented, I thought you meant the 'field' parts where. Well then, to make it work with such data:
      my $code; { local $/; $code = <DATA> } # fast "slurping" @records = split /\n (?=\w)/, $code; for (@records) { print ">>$_<<\n"; } __DATA__ japhy DALnet regular Regex Prince merlyn Perl Hacker O'Reilly Author Mark_Dominus IAQ Author ArrayHashMonster Creator


      japhy... Perl Hacker and Regex Prince
      Use japhy's suggestion and add a newline to your string:
      @records = split /(?=\nrecord)/, ("\n" . $data);
      ...that way you get the necesary first newline in the regex for the first record.

      Crude, but effective :)

      If you want to capture to the end of the line, in /m mode, $ anchors at the end of a line. In /s mode, ^ matches at the beginning of the string and $ matches at the end. So maybe you did want /m. Of course, you can also just do something like these:
      while (<DATA>) { my ($key, $value) = /(\S+) (.*)/ or next; # or: my ($key, $value) = split; # or: (undef, $key, $value) = split(/\s+/, $_, 3); $hash{$key} = $value; # or: push(@{$hash{$key}}, $value); }
      Untested, but you might get some ideas from that.