comment on

anli_:

Considering that the bulk of your data appears to be the text representation of various paths through a taxonomy tree, I thought that you might be able to fit it all into memory (taking advantage of all the redundancy) if you built a pair of trees and connected them together at the leaves. For example, if your data looked like this:

1;2;8;root;xyzzy;cat
1;2;5;root;xyzzy;dog
1;9;root;bird
[download]

Then we could build an index tree (on top) and the taxonomy tree (below), tying them together (shown by vertical lines) like this:

        1
       / \
      /   \
     2     \
   --^-     \
  /    \     \
 8     5      9
 |     |      |
cat   dog   bird
 \     /     /
  xyzzy     /
      \    /
       root
[download]

If your tree is relatively shallow but broad, you should be able to save a considerable amount of space.

Here's some code that builds the two trees and looks up some of the numeric keys to display the data. Let me know if it does the trick for you, I'd be interested in knowing how much memory it takes to hold your database.

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dump qw(pp);

# the file containing the sample data you provided
open my $FH, '<', '1118135.dat2' or die $!;

# Index (multilevel numeric index)
my $tree_nums  = { };

# Hierarchy tree
my $tree_names = { };

# randomly-selected test data
my @tests;

###
# Read file and build index hash as well as taxonomy hash
###

while (my $t = <$FH>) {
    my @flds = split /;\s*/,$t;

    # split numbers into keys and hierarchy into flds
    my @keys;
    push @keys, shift @flds while $flds[0] =~ /^\d+$/;

    # select some of the keys for testing
    if (.1 > rand) {
        push @tests, [ @keys ];
    }

    # build taxonomy tree
    my $leaf = build_name_tree(@flds);

    # Build index tree
    my $p = $tree_nums;
    while (@keys) {
        my $k = shift @keys;
        $p->{$k} = {} if ! exists $p->{$k};
        $p = $p->{$k}
    }

    # Tie index tree to taxonomy tree
    $p->{ch} = $leaf;
}

# Debug junk
#dump_tree($tree_names), "\n";
#dump_tree($tree_nums), "\n";

# Run some test lookups
for my $t (@tests) {
    my $p = $tree_nums;

    # Given a list of keys, get the reference to the desired
    # leaf of the taxonomy tree
    my @keys = @$t;
    for my $k (@keys) {
        die "Eh? " if ! exists $p->{$k};
        $p = $p->{$k};
    }
    # Now traceback function gives you the path from root to leaf
    print join(":",@keys), " => ", join(';', traceback($p->{ch})), "\n
+";
}

sub dump_tree {
    my ($t,$lev) = @_;
    $lev = $lev//0;

    if ('HASH' eq ref $t) {
        for my $k (sort keys %$t) {
            if ($k eq 'par') {
                print "  " x $lev, "Parent is <$t->{$k}[0]>\n";
            }
            else {
                print "  " x $lev, $k, "\n";
                dump_tree($t->{$k}, $lev+1);
            }
        }
    }
    elsif ('ARRAY' eq ref $t) {
        die "T is an array of " . scalar(@$t) . " elements";
    }
    else {
        #print pp($t), "\n";
        die "T is " . ref($t);
    }
}


sub build_name_tree {
    my @path = @_;

    my $p = $tree_names;

    while (@path>1) {
        my $parent = shift @path;
        my $child  = $path[0];
        if (! exists $p->{$child}) {
            $p->{$child} = { par=>[$parent, $p] };
        }
        else {
            if ($p->{$child}{par}[0] ne $parent) {
                die "Unexpected $child:$parent ??? $p->{$child}{par}[0
+]" unless $p->{$child}{par}[0] eq $parent;
            }
        }
        $p = $p->{$child};
    }
    return $p;
}

sub traceback {
    my ($node) = @_;
    my @lineage;
    while (defined $node and exists $node->{par}) {
        push @lineage, $node->{par}[0];
        $node = $node->{par}[1];
    }
    return reverse @lineage;
}
[download]

The taxonomy tree has parent links in it to let you get from the leaf to the root of the tree, and the traceback function will walk the tree back to the root for you.

Update: I added a couple comments to the code to clarify it a little.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re^3: Using indexing for faster lookup in large file by roboticus
in thread Using indexing for faster lookup in large file by anli_

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.