in reply to Self-Populating Tree Data Structure

Building a hash is a good way to go. With 60MB, it should all fit within memory. Of course an alternate idea would be to use system sort on that file and then deleting lines that occur more than once. This would be appropriate for GB size files. For this app, hash should work great.

A hash key can be any kind of string. There is actually not even a need to remove the \n!

my %hash; while (<IN>) { $hash{$_}++; } while ( my($key,$value) = each %hash) { print $key if ($value == 1); }
above would print non-duplicate lines. Note that there is no need to check "if exists" or "if defined", if a key doesn't exist, Perl will create it before the ++ increment!

Now let's say that there is some need to parse the line with split or a REGEX into 3 different things, $file,$line,$rule...There is no need to do a join to make the key.$hash{"$file$line$rule"}++; would be just fine.

Update: If this is necessary, you can put some token (could be multi-character or single ";",etc) between the items, like "$file;$line;$rule" so that you can use simple split to get the 3 things back without needing a HoL (Hash of List) in the value field. Think simple and make it more complex if you need to.

As far as "Perl Limitations" with complex data structures...there aren't any! A Perl equivalent to any kind of arbitrarily complex thing that you could make in C, can be made in Perl. Having said that, the Perl basic structures are super powerful! And I think enough for the app you have described. As far as execution time goes, I would think that we are talking seconds, not minutes as you can do everything with one single linear pass through the input file.

Replies are listed 'Best First'.
Re^2: Self-Populating Tree Data Structure
by Anonymous Monk on Apr 30, 2009 at 20:55 UTC
    Thank you, I finally got around to trying this out. Rather than use the data tree, I used a hash with keys of the form "$rule$file$line" and made the value be 1 (anything would do, as long as a value exists). I then checked if the value existed, and if it did I ignored the line. It took less than a minute to complete.