LittleGreyCat has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

Let me first try to describe the problem I am trying to solve:

I have a large file with many complex entries, each relating to a file in a Unix filestore tree.

I wish to extract a subset of these entries to match part of the current filestore tree; I can produce a list of the current filestore tree using the 'find' command.

So I have two files:

The true filestore list

/fred/myfile
/fred/myfile2
/bert/myfile

The large complex file

user ALLFILES /fred/myfile=/archive/dingbat/fred/myfile 3 6 9 thegoosedrankwine
user ALLFILES /fred/myfile2=/archive/dingbat/fred/myfile 3 6 9 thegoosedrankwine
user ALLFILES /fred/myfile3=/archive/dingbat/fred/myfile 3 6 9 thegoosedrankwine
user ALLFILES /bert/myfile=/archive/dingbat/fred/myfile 3 6 9 thegoosedrankwine
user ALLFILES /bert/myfile2=/archive/dingbat/fred/myfile 3 6 9 thegoosedrankwine

You will note that the third field (up to the '=') in the complex file is the filename in the real filestore tree.

My tentative plan is to set up the first file as a hash, indexed by the whole contents of each line, and then read serially through the second file, splitting out the file name component and matching it with the Hash.

If I get a hit, I then overwrite the matching entry in the Hash with the current line in my complex input file.

At the end I should have copied all the matching entries out of the complex file, and these should now be in the other file.

Any lines without a match will be unchanged.

So, the question:

Can I use 'Tie::File' to generate the Hash (which makes this scalable to work with large files and small memory), should I work in memory, or is there some other Perl feature which will make this so easy that I will be embarrased that I asked the question.

TIA

LGC

Nothing succeeds like a budgie with no teeth.

Replies are listed 'Best First'.
Re: Tie::File to create a Hash?
by Fletch (Bishop) on May 31, 2007 at 13:54 UTC

    You don't want Tie::File, you want to use DBFile or BerkelyDB to create the hash-on-disk of the first file. You'd then read the second line by line and extract the path and use exists $tied_hash_on_disk{ $path } to check if you should print the line or not.

Re: Tie::File to create a Hash?
by citromatik (Curate) on May 31, 2007 at 14:20 UTC

    A simple solution to your problem doesn't involve perl at all. Using the shell you can achieve this in just a simple line:

    $ sed 's/=/ /' complex_file | join -a 1 -1 1 -2 3 simple_file -

    This works only if you have the files sorted by the joint field, e.g:

    $ mv simple_file simple_file.bk; sort simple_file.bk > simple_file $ mv complex_file comple_file.bk; sed 's/=/\t/' complex_file.bk | sort + -k 3,3r | sed 's/\t/=/' > complex_file

    Hope this helps!

    citromatik

      Interesting approach, and I can see where you are going.

      'join' seems to do more or less exactly what I want.

      Having 'sed' problems, though. The '/\t/' substitution seems to work on the character 't', not creating/removing a tab character as I expected.

      The resulting changes to the 't' in '/opt' give unexpected results.

      I will experiment further.

      Thanks

      Nothing succeeds like a budgie with no teeth.
Re: Tie::File to create a Hash?
by blazar (Canon) on May 31, 2007 at 13:57 UTC
    Can I use 'Tie::File' to generate the Hash (which makes this scalable to work with large files and small memory), should I work in memory, or is there some other Perl feature which will make this so easy that I will be embarrased that I asked the question.

    Nope, from perldoc mod://Tie::File

    NAME Tie::File - Access the lines of a disk file via a Perl array

    You should work in memory if memory is enough. Otherwise if usage is large, go the tie way. Of course find a suitable module and not the one you mentioned. How 'bout DB_File for example?