Re: breaking a text file into a data structure -- best way?

Let me try to help:

First analyze the structure of your input, name the parts of it. This input consist of lines, each line consists three fields: a prefix, a separator and a text field. Consecutive prefixes form a block and consecutive blocks form a prefix alphabet.

Second, answer this question: is the input parsable line-by-line or you have to look around (at a certain point) to decide what-is-what? The former answer is typically resulting more efficient programs (but it is not possible for all types of input) and the latter is generally easier to code, but requires to hold more of your input in memory. (I decided to choose the line-by-line approach by storing the previous prefix only beyond the current line.)

Then constrain yourself to go through your input line-by-line and ask yourself: what are the states (or state transitions) determining what should I do?

at the start of a new (consecutive) prefix-block
in the middle of a prefix-block
prefix alphabet starting over

How to map these states to relations between lines? By comparing the prefix of the current and the previous line.

What is the tool to express these relations between the lines? Alphabetical comparison. The mapping is (cf. with the previous listing):

$prefix gt $prev_prefix
$prefix eq $prev_prefix
$prefix lt $prev_prefix

What should I do at each state transition?

add $prefix => $text to the current hash
append $text to the current $hash{$prefix}
push a new hash ref to your array: { $prefix => $text }

Now try to write it again and if you're stuck, come back and look at this:

use strict;
use warnings;
use Data::Dump qw( pp );

my $ref = [ {} ];
my $prev_prefix = '';

while (<DATA>) {
  my ( $prefix, $text ) = split /> ?/;
  if ( $prefix gt $prev_prefix ) {
    $ref->[-1]{$prefix} = $text;
  }
  elsif ( $prefix eq $prev_prefix ) {
    $ref->[-1]{$prefix} .= $text;
  }
  else {
    push @$ref, { $prefix => $text };
  }
  $prev_prefix = $prefix;
}

pp $ref;

__DATA__
a> some random text
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
c> some more 
c> 
c> random
c> text
c> 
a> some random text
b> 
b> a few random
b> lines
b> 
b> of more 
b> random
b> 
b> text
c> some more 
c> 
c> random
c> text
c>
[download]

Of course this is only one approach, but the clearing of concepts, methodical thinking of the mechanical way to solve a problem always helped me.

And in general: practice and practice more. Read books, read the code of others (not just glance over, but change them, understand them), read the problems of others and try to solve them without looking at the solution posted by others.

Cheers

Comment on Re: breaking a text file into a data structure -- best way? Select or Download Code

Replies are listed 'Best First'.
Re^2: breaking a text file into a data structure -- best way? by punkish (Priest) on Apr 10, 2010 at 00:35 UTC
Thanks for the response, but you misunderstood my task. The 'a>', 'b>', 'c>' are not really present in the text file. I included them as "line numbers" to illustrate where I wanted the text split up. In the specific case I presented, the text is split up at the line before the line that starts with '======'. In any case, I am curious about a general approach to such problems, and at first glance, it seems that a state machine approach would help me. However, I got stuck with that as well, especially since my splitting markers are not in the line where I want to split the text, but after the line on which I want to split. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re^3: breaking a text file into a data structure -- best way? by rubasov (Friar) on Apr 10, 2010 at 03:42 UTC
but you misunderstood my task Indeed. In the last days I'm doing really stupid things, sorry.	[reply]