breaking a text file into a data structure -- best way?

punkish has asked for the wisdom of the Perl Monks concerning the following question:

Update0: My Best buddy tells me such class of problems are called "State Machine." Googling for "Perl state machine" returns a bunch of hits that I am now in the process of digesting. In the meantime, I look to your help.
Update1: Seems like http://www.perl.com/pub/a/2004/09/23/fsms.html might have the answer for me.

I have a longish text file like below. The gutter annotation is not a part of the text file, but only to aid my question.

a> some random text
   ----------------
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
   ****************
c> some more 
c> 
c> random
c> text
c> 
a> some random text
   ----------------
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
   ****************
c> some more 
c> 
c> random
c> text
c>
[download]

I want to split the file into an array of hashes like so

@foo = (
    {
        a => 'some random text'
        b => 
'
a few random
lines

of more 
random

text
'
        c => 
'some more 

random
text
'
    },
     {
        a => 'some random text'
        b => 
'
a few random
lines

of more 
random

text
'
        c => 
'some more 

random
text
'
    },
    .. and so on ..
);
[download]

In other words, each hash is made up of the snippet of text starting from the line that is followed by '--------------' up to, but not including, the next line that is followed by '--------------'.

I have two questions -- one, how do I do the above? I have been hitting my head against a wall the entire day yesterday, so I come to you today. I have nothing to show you because I everything I did was wrong. My approach was mostly to start from the beginning and go to the end, trying to keep flags on when one hash element began and when it ended, and so on. Which brings me to my second question.

What is the canonical design pattern for such a problem? I come across such problems all the time, and I always slow down in trying to solve them. A pattern that is visible to the eye becomes very difficult to program. Yesterday I had another such problem which I managed to solve, if I may say so myself, rather innovatively. The text file looked like so

   bri    red    grn    blu
     0      0      0      0
     1      0      0      0
     2      0      0      0
..
    99      0      0      0
   100      0    255    255
   101      0    250    255
   102      0    246    255
..
[download]

The above had to be converted to

CLASS
    EXPRESSION ([pixel] >= 242 AND [pixel] <= 242
    STYLE
        COLOR 200 72 127
    END
END
CLASS
    EXPRESSION ([pixel] >= 175 AND [pixel] <= 175
    STYLE
        COLOR 191 236 0
    END
END
..
[download]

That is, group the brightness values by color triplets. After struggling with it for a while with the usual, line by line, flag as you go approach, I decided to turn the color triplets into keys of a hash. The problem was solved in a couple of lines, and elegantly. Here is the code for that

while (<INFILE>) {

    # remove leading whitespace & newline from end
    # and split the row on whitespace
    my @r = chomp && s/^\s+// && split /\s+/;
    
    # create a key in lut hash using rgb vals
    push @{$lut{"$r[1].$r[2].$r[3]"}}, $r[0];
}

while (my ($k, $v) = each %lut) {
    $k =~ s/\./ /g;     # replace . in hash key with space
    my @v = sort @$v;   # sort the color brightness array to get 
                        # min/max values
    
    print OUTFILE 
       "CLASS\n" . 
       "    EXPRESSION ([pixel] >= $v[0] AND [pixel] <= $v[$#v]\n" . 
       "    STYLE\n" . 
       "        COLOR $k\n" . 
       "    END\n" . 
       "END\n";
}
[download]

I was able to solve above because of the uniqueness requirement, else it would have been the usual slog. So, is there a generic approach to this? And, is there a way I can validate the output... ensure that the output is what I really want, given very long input text files?

--

when small people start casting long shadows, it is time to go to bed

Comment on breaking a text file into a data structure -- best way? Select or Download Code

Replies are listed 'Best First'.
Re: breaking a text file into a data structure -- best way? by sierpinski (Chaplain) on Apr 09, 2010 at 14:52 UTC
One of the many answers to #1 would be to: read a line - store it read the next line - compare it, if it matches your ----- or **, then save the first line as your key. Read the two lines. If either of them are your ----- or **, then save the previous one as the next key, and the ones before it as the previous hash's values. Another possible solution: Start by reading the whole file into an array, one line per entry. Find the ----- lines, and split at one position before it, and use that section to create your hash. It might not be the best way, but its the first couple that come to mind.	[reply]
Re: breaking a text file into a data structure -- best way? by rubasov (Friar) on Apr 09, 2010 at 16:59 UTC
Let me try to help: First analyze the structure of your input, name the parts of it. This input consist of lines, each line consists three fields: a prefix, a separator and a text field. Consecutive prefixes form a block and consecutive blocks form a prefix alphabet. Second, answer this question: is the input parsable line-by-line or you have to look around (at a certain point) to decide what-is-what? The former answer is typically resulting more efficient programs (but it is not possible for all types of input) and the latter is generally easier to code, but requires to hold more of your input in memory. (I decided to choose the line-by-line approach by storing the previous prefix only beyond the current line.) Then constrain yourself to go through your input line-by-line and ask yourself: what are the states (or state transitions) determining what should I do? at the start of a new (consecutive) prefix-block in the middle of a prefix-block prefix alphabet starting over How to map these states to relations between lines? By comparing the prefix of the current and the previous line. What is the tool to express these relations between the lines? Alphabetical comparison. The mapping is (cf. with the previous listing): `$prefix gt $prev_prefix` `$prefix eq $prev_prefix` `$prefix lt $prev_prefix` What should I do at each state transition? add `$prefix => $text` to the current hash append `$text` to the current `$hash{$prefix}` push a new hash ref to your array: `{ $prefix => $text }` Now try to write it again and if you're stuck, come back and look at this: Read more... (989 Bytes) Of course this is only one approach, but the clearing of concepts, methodical thinking of the mechanical way to solve a problem always helped me. And in general: practice and practice more. Read books, read the code of others (not just glance over, but change them, understand them), read the problems of others and try to solve them without looking at the solution posted by others. Cheers	[reply] [d/l] [select]
Re^2: breaking a text file into a data structure -- best way? by punkish (Priest) on Apr 10, 2010 at 00:35 UTC
Thanks for the response, but you misunderstood my task. The 'a>', 'b>', 'c>' are not really present in the text file. I included them as "line numbers" to illustrate where I wanted the text split up. In the specific case I presented, the text is split up at the line before the line that starts with '======'. In any case, I am curious about a general approach to such problems, and at first glance, it seems that a state machine approach would help me. However, I got stuck with that as well, especially since my splitting markers are not in the line where I want to split the text, but after the line on which I want to split. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re^3: breaking a text file into a data structure -- best way? by rubasov (Friar) on Apr 10, 2010 at 03:42 UTC
but you misunderstood my task Indeed. In the last days I'm doing really stupid things, sorry.	[reply]
Re: breaking a text file into a data structure -- best way? by ikegami (Patriarch) on Apr 10, 2010 at 04:18 UTC
`my $hdr = <>; <>; my @part1; my @part2; my $part = \@part1; while (<>) { if ($_ eq "----------------\n") { my $next_hdr = pop(@$part); process_rec($hdr, \@part1, \@part2); $hdr = $next_hdr; @part1 = (); @part2 = (); $part = \@part1; } elsif ($_ eq "****************\n") { $part = \@part2; } else { push @$part, $_; } } process_rec($hdr, \@part1, \@part2);` [download]	[reply] [d/l]
Re: breaking a text file into a data structure -- best way? by repellent (Priest) on Apr 10, 2010 at 05:56 UTC
Here's using the `until-eof-FILEHANDLE` technique: my @foo; my $next_a = <DATA>; scalar(<DATA>); until (eof(DATA)) { my $a = $next_a; my @b; while (my $line = <DATA>) { last if $line =~ /^[]{16}$/; push @b, $line; } my $found_next; my @c; while (my $line = <DATA>) { last if $found_next = ($line =~ /^-{16}$/); push @c, $line; } $next_a = pop(@c) if $found_next; push @foo, { a => $a, b => join("" => @b), c => join("" => @c), }; } use Data::Dumper; print Dumper \@foo; __END__ title 1 ---------------- a few random lines of more random text ************* some more random text title 2 ---------------- cow jumped over the moon ************** corn on the cob lobster thermidor [download]	[reply] [d/l] [select]
Re: breaking a text file into a data structure -- best way? by rubasov (Friar) on Apr 10, 2010 at 15:51 UTC
To amend my stupidity yesterday, here's another approach by peeking into the next line: Read more... (1397 Bytes) p.s.: found japhy's much better implementation for the peeking: Peek.pm	[reply] [d/l]