in reply to Reg: Performance

Sounds like an ideal job for Tie::File::AsHash ?

Anyway, you really want to avoid reading DUMP_B millions of times, once for every line in DUMP_A. It should be possible to read in both files once, and create a hash for each, keyed by the ID field, then iterate through the hash for DUMP_A and look up the corresponding entry in the hash for DUMP_B - this is the reason I suggested the above module (haven't actually tried it, shame on me) - you should be able to treat each dump file like a hash, and split on whatever character you want.

Replies are listed 'Best First'.
Re^2: Reg: Performance
by use perl::always (Initiate) on Oct 28, 2010 at 09:02 UTC

    Greetings,

    I realize that it is your intention to use perl for this task, and while I haven't seen either dump_a || dump_b

    I can't help but wonder if

    cat | sed | sort | uniq

    might not be of great help here.

    I run a _huge_ RBL with _millions_ of IP addresses.

    I constantly need to parse logs, and add/remove results from the block lists.

    While I began my strategy using perl scripts. I ultimately found that

    cat | sed | sort | uniq

    would accomplish the task in seconds as opposed to minutes/hours.

    Perhaps it's my perl skills. But I just thought it was worth mentioning.

    HTH

    --Chris

    Shameless self pronotion follows
    PerlWatch