downer has asked for the wisdom of the Perl Monks concerning the following question:

I am doing some calculations based on the graph generated by hyperlinks on the web. I have one huge file (~500GB), in which the nth line corresponds to all nodes in the graph adjacent to the nth node; the file is a list of adjacency lists.

For my calculations I need to access lines in this file at random based on my calculations, since obviously this data is way to much for any more advanced data structure. The question is, what is the fastest way to access lines in such a huge file at random?

The way I see it, I have a few options:

  1. linear search
  2. try my luck with Tie::File
  3. find the longest line, pad each line so all lines are the same length, and use byte offsets to access lines with seek (or something)
  4. find number of bytes on each line, maintain auxiliary data structure (BerkeleyDB?) storing byte to line offset

Ordinarily I would try to do this in C++, but since I discovered perl, I have a high level of faith in its capabilities. This is my first post, I welcome your advice!

  • Comment on Help performing "random" access on very very large file

Replies are listed 'Best First'.
Re: Help performing "random" access on very very large file
by almut (Canon) on Jul 16, 2007 at 14:38 UTC

    I would go through the file once in order to create an auxiliary data structure holding the byte offsets to the individual lines. As that data structure will probably still be too large to fit in memory (depending on the average line length), you'll want to store that in a file — but this time with fixed record size, so you don't have the same problem as with the original file...

    You can then access a random line like this: multiply the line number with the record size, seek to that position in the auxiliary file, read the byte offset found there, and finally seek to that position in the original file.

      I believe I could possibly do both; make my 1 big file into say, 1000 smaller files, and then find the byte offsets and create additional data structures for each. This way I could also spread the files across several disks, allowing the random accesses to be shared. When I write some code, i'll post it up! one problem is just getting an exact number for the line count of the big file, wc -l is taking forever! I just looked at the size of the file, and number of bytes for a small sample of the file to get an estimate of the line count.
        I'm not sure you really need to know an exact line count to do this... Just put the first 1024 lines into the first file, the next 1024 into the second file, etc. and end up with as many files as you end up with, using the disks round-robin to distribute them as evenly as possible. (The number of lines per file is arbitrary, of course, but I'm guessing 1024 would provide a manageable number of files and powers of 2 have the nice property of allowing you to just do $line_number >> 10 to determine the file to use instead of requiring the CPU to do actual division.)
Re: Help performing "random" access on very very large file
by ikegami (Patriarch) on Jul 16, 2007 at 14:57 UTC

    An option you didn't list is: Create an index. ( Oops, I guess that's an implementation of option 4. ) Basically, that's what Tie::File does, but Tie::File builds the index in memory whereas you'll be building it in a file.

    There are two advantages: Tie::File would easily require >1GB of memory in your situation, and if the index resides on disk, you build it once and use multiple times, as long as the data file doesn't chage.

    Read through the huge file writing the starting locations of every line into another file (in fixed-width binary, such as pack('N2', $high, $low) or pack('Q', $addr)).

    Then, you can seek to random (divisible by 4) locations into the index file to know where to seek into the data file.

    Note: Perl IO is supposedly quite slow — was this fixed? — so you could write a tiny C program to create the index file since that's pure IO and just as simple to write in C as in Perl.

    Note: This would require your Perl, its seek and its tell to handle 64-bit numbers.

Re: Help performing "random" access on very very large file
by rpanman (Scribe) on Jul 16, 2007 at 14:28 UTC
    How about splitting the file down into to a number of smaller files each with a set number of lines on it. So you could end up with 1000 files of 500 lines (for example). Then you just need a mechanism for breaking a line number (in the 500GB file) into a filename and line number within that file, quicker access all round.

    The best way to eat an elephant is one bit at a time...
Re: Help performing "random" access on very very large file
by ctilmes (Vicar) on Jul 16, 2007 at 16:26 UTC
      they seem to have some good advice for indexing, I could probably just index every ~50 lines instead of every line, then, loading that block in memory.

        One very effective optimisation for reducing the size of the index, is to store just lengths rather than full absolute offsets for most of the lines.

        Say for example, that you know that all the lines in your 500GB file are less than 255 characters long, and for calculation purposes, average 100 characters. Then if you build a full, absolute index using the pack 'd' template (required for anything over 4GB), then your index itself is going to be 500GB / 100 * 8 ~= 40 GB!

        However, if you store the absolute index of every (say) 1000th line, and 999 x 1-byte lengths for the intervening lines, the index size becomes: 500 GB / (100*1000) * 1007 ~= 5GB.

        Each record within your index is then a fixed 1007 bytes, and the process of locating any individual line becomes:

        1. my( $recNo, $offset ) = ( int( $lineNo / 1000 ), ( $lineNo % 1000 ) - 1 );
        2. seek INDEX, $recno *1007, 0;
        3. read INDEX, $idxBuf, 1007;
        4. my $absOffset = unpack 'd', $idxBuf;
        5. my( $addOffset, $length ) = unpack "x[d] %32C$offset C", $idxBuf;
        6. seek BIGFILE, $absOffset + $addOffset, 0;
        7. read BIGFILE, $line, $length;

        By using the 'checksum' unpack template (%n-bits) to perform the summing of the intermediate lengths, the calculation of the composite offset is really very fast. This puts any line of your huge file just two reads away.

        Note: There is nothing special about the number 1000 I've used in this example except that it is convenient for calculations. As we are calculating a 32-bit offset (%32C), and representing length with 8-bit unsigned chars, the theoretical maximum, absolute offset spacing is 2**32 / 2**8 == 2**24 == 16 MB.

        A quick test suggests that it takes Perl ~ 60 milliseconds to calculate the sum of 16 MB chars. That's far less time than having to read even 1 extra data record.

        However, there is little extra saving to be had by further decreasing the frequency of the absolute offsets.

        • Using an absolute offset every 1000 lines gives an index size of: 5.035 GB

          The checksum time for 999 bytes: 112 microseconds.

        • Using an absolute offset every 64k lines gives an index size of: 5.00053405761719 GB

          The checksum time for 64k bytes: 320 microseconds.

        • Using an absolute offset every 16MB lines gives an index size of: 5.00000208616257 GB

          The checksum time for 16MB bytes: 60 milliseconds.

        So, you see there is little extra index space saving by decreasing the frequency of absolute offsets, but there is a considerable calculation penalty in doing so. You also have to factor in the extra time taken to read the larger index records from the index file.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Help performing "random" access on very very large file
by dave_the_m (Monsignor) on Jul 16, 2007 at 14:23 UTC
    Well, some obvious questions are: how many times will you use the file (ie will you resuse the same file in later runs)? For each use, what percentage of the lines will you need to randomly seek to? How many times will you access each line?

    Dave.

      I will be making a lot of accesses into the file (~5 million?) I know that some lines will be accessed more often than others, but as far as I can tell now, the accesses will be be at least initially uniform. Lines will be looked up more than once, so some smart caching would be beneficial, but until I start making the calculations, Its hard to tell which lines will be accessed more. Of course I have a few big disks, I can just make a few copies of the file and parallelize the task.
Re: Help performing "random" access on very very large file
by dsheroh (Monsignor) on Jul 16, 2007 at 14:50 UTC
    Of the options you listed, padding all lines to a fixed length or making an index of offsets would almost certainly be the fastest. Since you said you have plenty of disk to burn, padding would probably be the faster of the two, since you'd only have to do one seek per line vs. doing a seek to find the offset in your index file/db and then a second seek in the actual data file, but this would come at the cost of increased preprocessing time since you'd need to make two passes over the data file (one to find the longest line and one to get the data for padding), plus you'd be rewriting the whole thing on the second pass. It sounds like you'll be using it enough for the gains in lookup time to outweigh the extra preprocessing time, though.

    And then there's the unlisted option 5: Stuff it into an actual database. They're designed to look up arbitrary records quickly, although that again gets into the question of whether you'll be reusing the same dataset often enough to be worth the time taken to insert and index all the data.

Re: Help performing "random" access on very very large file
by sgt (Deacon) on Jul 16, 2007 at 15:55 UTC

    An interesting question. Some slightly different ideas from the excellent ones already given.

    Depending on the total number of accesses wrt to the total number of lines (Dave's question to which you didn't really answer), I would build a full index (ikegami's idea) or a shallow index.

    with 500 GB and an average line of 1000 bytes say, you still get a huge number of lines 500m (5*10^8), so a full index would be 500m * 4 bytes = 2GB file. A shallow index of say one entry every 50 would occupy only 2*10^9*2/100 i.e 4*10^7 = 40M a resonable number. To go to line n would mean seeking to position int(n / 50) * 4, read the offset and then seek k times the EOL marker (which implements essentially Tie::File logic). A shallow index is interesting when the number of accesses is much less than the total of lines of the main file.

    One other idea is to have a couple processes (or more). One would be a say daemon listening on a given port in charge of calculating the actual index based on the shallow index file, in charge of randomness, and eventually giving back a few lines. A simple protocol could be: send index and receive lines, or send number of lines and receive them. If you can arrange having the same big data file on different partitions with different disk controllers you could afford a process per disk say. The second (and main) process would be in charge of the analisis only. You could also implement recording of a session this way, round-robin caching could be a nice optimization.

    cheers --stephan
Re: Help performing "random" access on very very large file
by shmem (Chancellor) on Jul 16, 2007 at 14:43 UTC
    I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file... uhm well, have to acknowledge being disinformed here.. thanks ikegami and blue_cowdawg.

    I'd go indeed with 4), BerkeleyDB (DB_BTREE), key line number, value byte offset. That would allow you to do delta seeks back and forth. Splitting the one big file into a reasonable number (depending on number of file descriptors available) of smaller chunks might be helpful, as rpanman already noted.

    update: obvious correction

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
          I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file.

      Uhhhhmmmmm... no it don't! From the File::Tie man page:

          The file is not loaded into memory, so this will work even for gigantic files.
      I even remember looking at the source code for this module once to validate that statement. I used to think that the module loaded the whole file into memory but it doesn't.

      However, there is a lot of overhead associated with File::Tie for certain operations. For instance I had some code once that looked something like:

      # considering @ry was the tie'd array: my $last_line = $#ry + 1;
      and found that if I visited that line multiple times it slowed my code way down. Just FWIW...


      Peter L. Berghold -- Unix Professional
      Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

      I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file.

      If you're only reading from the file (as it seems to be the case here), Tie::File doesn't slurp the file. (I don't know how it handles writes.) It does keep a cache of lines in memory, but the size of the cache is configurable.

      What it does do is keep in memory the index of every line up until the last one accessed. scalar(@tied) and $#tied count as accessing the last line. So if you do random(@tied), Tie::File will read through the entire file to build an (in-memory) index of every line in the file. The index could easily take many GBs.

Re: Help performing "random" access on very very large file
by Moron (Curate) on Jul 16, 2007 at 15:48 UTC
    Where the information exceeds what can be stored in memory, the most usual solution is some kind of DBMS. Relational databases are only semi-infeasible for this kind of data structure in the sense you could store say 100000 lines in a BLOB and index those - an approach talked about but I have never seen tried.

    The most common custom DBMS structure is indeed flat-file-based as has been mentioned.

    But the most useful Perl implementations of non-relational database architecture are probably going to be the DBM family of modules. In particular, DBM::Deep (which has an underlying flat-file storage) looks perfect for your requirement.

    __________________________________________________________________________________

    ^M Free your mind!

A reply falls below the community's threshold of quality. You may see it by logging in.