in reply to Efficiently parsing a large file

Using a hash as mentioned earlier is probably easiest though it can get slow on huge numbers of elements -- even if you have to resort to tieing a hash to a file. One low tech way, if you are using some unix variant and have extra disk, may be to use the unix sort utility before you pass through the file with perl. Then all of your entries should appear in order or at least close together (though you may have to play some games with sort's options to get the exact order you want).

I've used sort on files in the GB range with millions of lines of text. Note: one good sanity check, in addition to checking for errors from sort, is to make sure the sorted file is the same size as the input file. Update: Took out references to "GNU" since I'm not sure I've specifically used it on files that large (though it might work fine).

Replies are listed 'Best First'.
Re: Re: Efficiently parsing a large file
by jfroebe (Parson) on Apr 08, 2004 at 21:23 UTC

    don't forget you could just read the output of sort directly.. that way you don't actually need to create another file to read in..

    Jason L. Froebe

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

      sort will create separate temp files anyway which it then merges together and pipes to you. You really aren't gaining much since you need almost as much disk space. Also, pipes can slow down IO for huge files. They are usually limited to a few KB of buffer so you get a lot of swapping (i.e. sort feeds some bytes to you then goes to sleep, you wake up and process them then go to sleep, then sort feeds some more, etc.).
Re: Re: Efficiently parsing a large file
by pelagic (Priest) on Apr 08, 2004 at 21:31 UTC
    Sorting with good tools can be very efficent but in this case you have to scan through the whole (sorted or unsorted) file one time anyway to find the (un-)complete sets. As the file file is nearly sorted anyway it's probably most efficent to not sort it explicitly.

    pelagic
      My impression from the original post was that lines were sorted within a serial number, but there could be many lines of other serial numbers interspersed in between. If this is wrong, then you are correct.
        I said nearly sorted. Sorted / unsorted is not binary. This is not a randomised file it's a logfile. All entries created by 1 specific transaction are sorted within themselves.
        That means for us that the number of not yet comleted series is relatively low.

        pelagic