peterp has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I usually don't have any problems parsing complicated text data in Perl, but this one has me stumped. My data basically describes a multidimensional hash, where the number of tabs (actually 4 spaces equates a tab in this case) at the beginning of each line and order represents its place. I realise I will likely need some kind of recursive routine, but it quickly becomes apparent I am going in completely the wrong direction every time I try something and haven't come anywhere close to a working solution. I understand its common courtesy to provide what I have tried so far, but in all honesty my attempts have been worse than useless and would be completely unhelpful to anyone. I'm more than happy with just some advice in response.

Regards,

Chris

Input:
one two three four five six
Expected output (Dumper $hash):
$VAR1 = { 'five' => { 'six' => undef }, 'one' => { 'three' => { 'four' => undef }, 'two' => undef } };

Replies are listed 'Best First'.
Re: Parse data representing hash
by AppleFritter (Vicar) on Jun 28, 2014 at 22:22 UTC

    Some advice, as requested:

    • Keep track of the current "chain of keys" as you progress through the lines.
    • For each line, use the number of tabs to determine how many of the previously-saved keys you need to keep. Truncate your chain, use it to add a new entry to your hash at the right spot with the newly-read key, then add that key to your chain.

    So your script would work like this:

    1. Read line 1. @keychain is empty. Zero tabs, so you truncate it to zero entries (a no-op, coincidentally). Progress through $hash along your @keychain; since it's empty, you're still at the root. Add an entry for 'one' to the hash, add 'one' to @keychain.
    2. Read line 2. @keychain contains 'one'. One tab, so you truncate it to one entry (another no-op). Progress through $hash along your @keychain; you'll end up at $hash->{'one'}. Add an entry for 'two' to the hash, add 'two' to @keychain.
    3. Read line 3. @keychain contains 'one' and 'two'. One tab, so you truncate it to one entry (NOT a no-op this time). Progress through $hash along your @keychain; you'll end up at $hash->{'one'} again. Add an entry for 'three' to the hash, add 'three' to @keychain.
    4. Read line 3. @keychain contains 'one' and 'three'. And so on...

    You get the idea - this is how I'd approach this.

      Thank you for your advice, it correlates with what others have said by maintaining state in @keychain.

      When truncating I suppose the best approach be something along the lines of splice @keychain, $tabs_count; (update: nevermind this question, the example provided by choroba covers this)

        Yes, splice is likely the best approach. From its documentation:

        Removes the elements designated by OFFSET and LENGTH from an array [...] If LENGTH is omitted, removes everything from OFFSET onward.

        So with zero-based arary indexing, it really is as simple as splice @keychain, $tabs_count;, yes.

        You could also use an array slice, BTW, e.g. @keychain = @keychain[0 .. ($tabs_count - 1)];, but that's less elegant and idiomatic.

Re: Parse data representing hash
by choroba (Cardinal) on Jun 28, 2014 at 22:05 UTC

      Hi,

      Thank you very much for your suggestion, i'm unable to test right now since i'm on holiday using an online compiler, but it looks exactly what I needed.

      My data is actually slightly more complicated than the example I provided, since each row has additional information, therefore was planning on eventually using the following design $hash->{one}->{children}->{two}..., $hash->{one}->{data}->{url} = 'example.com' etc to parse the row "one|url=>example.com", and I will read the documentation carefully to see if this is possible.

      Thanks again

      To maintain hash of hash of .... in case the data has lines of numbers ...  DiveRef($hash, \(@path)  );
Re: Parse data representing hash
by LanX (Saint) on Jun 28, 2014 at 22:00 UTC
    No need for recursion.

    Your HoHoH... has always hashrefs or undef as values.

    You need one state var: an @path array holding the refs till the last value so far.

    Whenever you parse the indentation you get the index of the next entry in @path and "parent" entry points to the hash you need to extend.

    if smaller you need to shorten @path, if bigger you have to extend @path and transform the last undef into a hashref.

    I hope you get the idea...

    I'm mobile so no chance for tested code, but I'm sure the archive has many examples.

    Cheers Rolf

    (addicted to the Perl Programming Language)

      Thank you,

      This information is very useful. If I understand correctly your suggested design is much like what choroba has suggested below, which appears to build the state into an array and pass this to the core DiveRef function.

      Regards,

      Chris

        Similar but not identical from what I see.

        I'd rather keep the values° in @path, choroba keeps the keys.

        Like this I don't need any dive function, the ref of the hash to be extended is already available (or must be autovivified if undef)

        update
        The only complication is that you prefer undef instead of an empty hash for leaves of your HoH tree, which leads to a test condition in edge cases.

        Cheers Rolf

        (addicted to the Perl Programming Language)

        °) or even the refs of the values

Re: Parse data representing hash
by hdb (Monsignor) on Jun 29, 2014 at 17:23 UTC

    Here is my humble attempt to solve your interesting puzzle:

    use strict; use warnings; use Data::Dumper; my %hash; my @row; while(<DATA>){ my ($level, $word) = /^(\s*)(\w*)$/; $level = length($level)/4; $row[$level] = $word; $#row = $level; my $last = \%hash; for (0..$#row) { $last->{$row[$_]} = $_ == $#row ? undef : {} if not defined $last- +>{$row[$_]}; $last = $last->{$row[$_]}; } } print Dumper \%hash; __DATA__ one two three four five six

      Thank you very much for providing your attempt, its very similar to my alternative version to using Data::Diver, but has provided some insight into ways I can improve. Notably, the $#array = $index syntax is new to me and is a nice alternative to using splice or a slice. Also, I prefer your for loop over my recursive function to construct the resultant hash.

      Regards.

Re: Parse data representing hash
by ww (Archbishop) on Jun 28, 2014 at 21:47 UTC
    "I understand its common courtesy to provide what I have tried so far...."

    Well, 'courtesy' is only part of it. The more important part grows out of the Monastery's reason for existence: to help you learn. It's hard to know where your issues are ... ie, what you may need to know/learn more about -- if you don't show us your attempt(s) (boiled down to a concise failure case) and the exact warning, error and/or (output from your code+explanation of how that's not what you intended).

    No downvote from me, in this case, but please heed this advice.


    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.

    check Ln42!

      Hi,

      This is my code as it stands, I know it doesn't work and I know why it doesn't work. It basically represents the foundations of my best attempt.

      Chris

      use strict; use warnings; use Data::Dumper; my $rows = [ ]; while ( <DATA> ) { chomp; my $depth = $_ =~ s/\s{4}//g; $depth ||= 0; push @$rows, { depth => $depth, key => $_ }; } my $ref = process( { }, $rows ); print Dumper $ref; sub process { my ( $ref, $rows ) = @_; my $current_row = shift @$rows; my $next_row = $rows->[0] // return $ref; my $current_key = $current_row->{key}; my $current_depth = $current_row->{depth}; my $next_key = $next_row->{key}; my $next_depth = $next_row->{depth}; print "$current_key, $current_depth, $next_key, $next_depth\n"; $ref = $ref->{$current_key} = { }; if ( $current_depth > $next_depth ) { return $ref; } elsif ( $current_depth < $next_depth ) { $ref = process( $ref, $rows ); } return $ref; } __DATA__ one two three four five six
Re: Parse data representing hash
by remiah (Hermit) on Jun 30, 2014 at 00:39 UTC
    I was thinkg of another attemt to solve this puzzle.
    Making hash notatin text and eval it
    And this is nothing better than hdb's one, maybe ...
    use strict; use warnings; use Data::Dumper; sub proc{ my($pre, $cnt_cur, $end_brackets)=@_; my $buff; $buff = $pre->{data} ; if ($cnt_cur > $pre->{cnt}){ #next is children $buff .=' => {'; push(@$end_brackets, '}'); }elsif ( $cnt_cur == $pre->{cnt} ){ #next is brothers $buff .= ' => undef ,'; } else { $buff .= ' => undef ,'; for ( 1 .. ($pre->{cnt} - $cnt_cur) ){ #output right bracke +t till that depth $buff .= pop(@$end_brackets) . ","; } } return $buff; } my ($pre, $cnt, $ret, @end_brackets); $ret='{'; while(<DATA>){ $cnt = $_ =~ s/\s{4}//g; $cnt = $cnt || 0; if ( $pre ){ $ret .= proc($pre, $cnt, \@end_brackets); } $pre={cnt => $cnt, data=>$_}; } $ret .= proc($pre, 0, \@end_brackets); $ret .= '}'; print Dumper eval($ret); __DATA__ one two three four three_another four_another five six seven eight nine ten 11 12 13 14 15 16 17 18 19 20 21 22 23 24

      Thanks for providing your solution to the problem. I found it very interesting particularly as you have taken a different approach to others by building a string then evaluating it into the data structure it represents.

      It works absolutely fine for the example data I provided, although my real world keys contain special characters such as { and } therefore I had to adjust to $buff = "'$pre->{data}'";. As a side effect of doing this, I also had to chomp each row.

      The only other difference to other solutions I noted when running with my real world data was, my real world data isn't entirely regular i.e. some rows unfortunately have extra spacing (stupid mistake in the code that generated the data) e.g.

      one two three

      Other solutions unwittingly accounted for this by creating an empty '' key on an intermediate level, although outputting "Use of uninitialized value in hash element" warnings, which I hindered by ensuring the undefined key defaulted to ''. I haven't yet fully understood your code in order to explain why, but under this scenario, your code breaks / generates an invalid data structure.

      In order to fix this issue for all solutions, I think I will have to adjust the way depth is calculated, perhaps compare the current row to the previous and check whether there is a bigger or smaller (the complexity is how much smaller) gap as oppose to assuming there will be a change of either +4, 0 or multiples of -4 spaces, or atleast keep track of when the extra spacing occurs and account for it. My real world data isn't irregular enough to make it impossible to decipher which parent level to return to, the extra spacing consistently occurs at particular levels. Perhaps it will be best just to simply programatically clean up the data before processing!

      Thanks again