Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Like most other posters, I'm new to perl and have been learning it for the past week. Due to the nature of the assignment, I'm unfortunately unable to post any code, and the limitation I'm under requires that variables be declared at the top of the program, c-style. I've been assigned to a company project to create a tool that reads a text file, goes through every line and removes duplicates and non-essential lines, then performs certain actions on the unique lines. I managed to get a working, but inefficient script written. The problem I ran into was how to efficiently compare lines of text in a text file that is at least 60mb. I managed to cut the run time from 3 hours to 15 minutes by restricting how often the script checks for duplicates, but that's still too long.

The solution I came up with is rather than comparing each read line with a list of existing unique lines, break up the lines into 3 keywords, then search a data tree for matches. I'll get to the tree's structure later in the post.

To understand the structure of the tree, each line that needs to be compared uses the following format:
"..\..\folder\folder\file.ext", line #: junk: rule #:
I need to compare the file name, line #, and rule # to the corresponding information from previous lines. If all three keywords match, the line is a duplicate and can be tossed aside. If not, the line needs to be added to the data tree for the next loop iteration. Getting these values hasn't been a problem.

A picture would be best, but I don't have the means to upload one, so here's my best shot. The format of the data structure, using perl's limitations, might be that of an array with length 3, where the first entry contains a list of Rule #s (Rule #s are the fewest in number, so are searched first). The second entry is a 2D matrix containing a list of files under each rule. The third entry would be a 3D matrix containing a list of line #s for every file.

Which leads to my problem. I've read up on creating multi-dimensional arrays in perl. What I can't find are any examples of a data tree that isn't binary. My biggest question is how do I declare the array(s)? Is it something like:

my @tree = {@R[],@F[@F2[]],@L[@L1[@L2[]]]};

I'll need to use some nested loops to search and add to the tree, so how do I iterate through the correct index of the correct array?

Also, should I use hashes instead of arrays? Is there an efficient way to do string comparisons (the comparison itself), therefore skipping the whole data tree mess?

Replies are listed 'Best First'.
Re: Self-Populating Tree Data Structure
by BrowserUk (Patriarch) on Apr 17, 2009 at 21:09 UTC

    Just concatenate your three fields together and use it as a hash key. The de-dup becomes a one-pass, O(1) lookup process:

    my %lookup; open IN, '<', ... open OUT, '>', ... while( <IN> ) { my( $file, $line, $rule ) = m[...(...)...(...)...(...)] or warn("Bad data: '$_' line: $.\n") and next; if( exists $lookup{ join $;, $file, $line, $rule } ) { ## Duplicate, discard next; } else { ## New, record. $lookup{ join $;, $file, $line, $rule } = 1; print OUT; ## output $_ } } close IN; close OUT;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Self-Populating Tree Data Structure
by Marshall (Canon) on Apr 18, 2009 at 02:20 UTC
    Building a hash is a good way to go. With 60MB, it should all fit within memory. Of course an alternate idea would be to use system sort on that file and then deleting lines that occur more than once. This would be appropriate for GB size files. For this app, hash should work great.

    A hash key can be any kind of string. There is actually not even a need to remove the \n!

    my %hash; while (<IN>) { $hash{$_}++; } while ( my($key,$value) = each %hash) { print $key if ($value == 1); }
    above would print non-duplicate lines. Note that there is no need to check "if exists" or "if defined", if a key doesn't exist, Perl will create it before the ++ increment!

    Now let's say that there is some need to parse the line with split or a REGEX into 3 different things, $file,$line,$rule...There is no need to do a join to make the key.$hash{"$file$line$rule"}++; would be just fine.

    Update: If this is necessary, you can put some token (could be multi-character or single ";",etc) between the items, like "$file;$line;$rule" so that you can use simple split to get the 3 things back without needing a HoL (Hash of List) in the value field. Think simple and make it more complex if you need to.

    As far as "Perl Limitations" with complex data structures...there aren't any! A Perl equivalent to any kind of arbitrarily complex thing that you could make in C, can be made in Perl. Having said that, the Perl basic structures are super powerful! And I think enough for the app you have described. As far as execution time goes, I would think that we are talking seconds, not minutes as you can do everything with one single linear pass through the input file.

      Thank you, I finally got around to trying this out. Rather than use the data tree, I used a hash with keys of the form "$rule$file$line" and made the value be 1 (anything would do, as long as a value exists). I then checked if the value existed, and if it did I ignored the line. It took less than a minute to complete.
Re: Self-Populating Tree Data Structure
by QM (Parson) on Apr 17, 2009 at 21:14 UTC
    Sounds like you need a hash (untested):
    my %seen; my %tree; while (<>) { chomp; # change this to match, as you didn't give a good example my ($file,$line,$rule) = m/^"(.*?)"\s*(\d+):\s*[^:]*?:\s*(\d+):/; unless ( exists( $seen{"$file,$line,$rule"} ) { push @{$seen{"$file,$line,$rule"}}, $file,$line,$rule; $tree{$rule}{$file}{$line} = 1; } }

    Update: changed %hash name to %seen to match my declaration.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of