comment on

Update0: My Best buddy tells me such class of problems are called "State Machine." Googling for "Perl state machine" returns a bunch of hits that I am now in the process of digesting. In the meantime, I look to your help.
Update1: Seems like http://www.perl.com/pub/a/2004/09/23/fsms.html might have the answer for me.

I have a longish text file like below. The gutter annotation is not a part of the text file, but only to aid my question.

a> some random text
   ----------------
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
   ****************
c> some more 
c> 
c> random
c> text
c> 
a> some random text
   ----------------
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
   ****************
c> some more 
c> 
c> random
c> text
c>
[download]

I want to split the file into an array of hashes like so

@foo = (
    {
        a => 'some random text'
        b => 
'
a few random
lines

of more 
random

text
'
        c => 
'some more 

random
text
'
    },
     {
        a => 'some random text'
        b => 
'
a few random
lines

of more 
random

text
'
        c => 
'some more 

random
text
'
    },
    .. and so on ..
);
[download]

In other words, each hash is made up of the snippet of text starting from the line that is followed by '--------------' up to, but not including, the next line that is followed by '--------------'.

I have two questions -- one, how do I do the above? I have been hitting my head against a wall the entire day yesterday, so I come to you today. I have nothing to show you because I everything I did was wrong. My approach was mostly to start from the beginning and go to the end, trying to keep flags on when one hash element began and when it ended, and so on. Which brings me to my second question.

What is the canonical design pattern for such a problem? I come across such problems all the time, and I always slow down in trying to solve them. A pattern that is visible to the eye becomes very difficult to program. Yesterday I had another such problem which I managed to solve, if I may say so myself, rather innovatively. The text file looked like so

   bri    red    grn    blu
     0      0      0      0
     1      0      0      0
     2      0      0      0
..
    99      0      0      0
   100      0    255    255
   101      0    250    255
   102      0    246    255
..
[download]

The above had to be converted to

CLASS
    EXPRESSION ([pixel] >= 242 AND [pixel] <= 242
    STYLE
        COLOR 200 72 127
    END
END
CLASS
    EXPRESSION ([pixel] >= 175 AND [pixel] <= 175
    STYLE
        COLOR 191 236 0
    END
END
..
[download]

That is, group the brightness values by color triplets. After struggling with it for a while with the usual, line by line, flag as you go approach, I decided to turn the color triplets into keys of a hash. The problem was solved in a couple of lines, and elegantly. Here is the code for that

while (<INFILE>) {

    # remove leading whitespace & newline from end
    # and split the row on whitespace
    my @r = chomp && s/^\s+// && split /\s+/;
    
    # create a key in lut hash using rgb vals
    push @{$lut{"$r[1].$r[2].$r[3]"}}, $r[0];
}

while (my ($k, $v) = each %lut) {
    $k =~ s/\./ /g;     # replace . in hash key with space
    my @v = sort @$v;   # sort the color brightness array to get 
                        # min/max values
    
    print OUTFILE 
       "CLASS\n" . 
       "    EXPRESSION ([pixel] >= $v[0] AND [pixel] <= $v[$#v]\n" . 
       "    STYLE\n" . 
       "        COLOR $k\n" . 
       "    END\n" . 
       "END\n";
}
[download]

I was able to solve above because of the uniqueness requirement, else it would have been the usual slog. So, is there a generic approach to this? And, is there a way I can validate the output... ensure that the output is what I really want, given very long input text files?

--

when small people start casting long shadows, it is time to go to bed

In reply to breaking a text file into a data structure -- best way? by punkish

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.