Update0: My Best buddy tells me such class of problems are called "State Machine." Googling for "Perl state machine" returns a bunch of hits that I am now in the process of digesting. In the meantime, I look to your help.

Update1: Seems like http://www.perl.com/pub/a/2004/09/23/fsms.html might have the answer for me.

I have a longish text file like below. The gutter annotation is not a part of the text file, but only to aid my question.

a> some random text ---------------- b> b> a few random b> lines b> b> of more b> random b> b> text **************** c> some more c> c> random c> text c> a> some random text ---------------- b> b> a few random b> lines b> b> of more b> random b> b> text **************** c> some more c> c> random c> text c>

I want to split the file into an array of hashes like so

@foo = ( { a => 'some random text' b => ' a few random lines of more random text ' c => 'some more random text ' }, { a => 'some random text' b => ' a few random lines of more random text ' c => 'some more random text ' }, .. and so on .. );

In other words, each hash is made up of the snippet of text starting from the line that is followed by '--------------' up to, but not including, the next line that is followed by '--------------'.

I have two questions -- one, how do I do the above? I have been hitting my head against a wall the entire day yesterday, so I come to you today. I have nothing to show you because I everything I did was wrong. My approach was mostly to start from the beginning and go to the end, trying to keep flags on when one hash element began and when it ended, and so on. Which brings me to my second question.

What is the canonical design pattern for such a problem? I come across such problems all the time, and I always slow down in trying to solve them. A pattern that is visible to the eye becomes very difficult to program. Yesterday I had another such problem which I managed to solve, if I may say so myself, rather innovatively. The text file looked like so

bri red grn blu 0 0 0 0 1 0 0 0 2 0 0 0 .. 99 0 0 0 100 0 255 255 101 0 250 255 102 0 246 255 ..

The above had to be converted to

CLASS EXPRESSION ([pixel] >= 242 AND [pixel] <= 242 STYLE COLOR 200 72 127 END END CLASS EXPRESSION ([pixel] >= 175 AND [pixel] <= 175 STYLE COLOR 191 236 0 END END ..

That is, group the brightness values by color triplets. After struggling with it for a while with the usual, line by line, flag as you go approach, I decided to turn the color triplets into keys of a hash. The problem was solved in a couple of lines, and elegantly. Here is the code for that

while (<INFILE>) { # remove leading whitespace & newline from end # and split the row on whitespace my @r = chomp && s/^\s+// && split /\s+/; # create a key in lut hash using rgb vals push @{$lut{"$r[1].$r[2].$r[3]"}}, $r[0]; } while (my ($k, $v) = each %lut) { $k =~ s/\./ /g; # replace . in hash key with space my @v = sort @$v; # sort the color brightness array to get # min/max values print OUTFILE "CLASS\n" . " EXPRESSION ([pixel] >= $v[0] AND [pixel] <= $v[$#v]\n" . " STYLE\n" . " COLOR $k\n" . " END\n" . "END\n"; }

I was able to solve above because of the uniqueness requirement, else it would have been the usual slog. So, is there a generic approach to this? And, is there a way I can validate the output... ensure that the output is what I really want, given very long input text files?

--

when small people start casting long shadows, it is time to go to bed

In reply to breaking a text file into a data structure -- best way? by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.