http://qs1969.pair.com?node_id=440845


in reply to Re^3: Muy Large File
in thread Muy Large File

Okay. I attempted the above suggestion:

1) Tried increasing $BUFSIZE ||= 2**31; but got this err Negative length at ./test.pl line 9. Tried subtracting 1 in this manner $BUFSIZE ||= 2**31-1 and got this err Out of memory during "large" request for 2147487744 bytes, total sbrk() is 108672 bytes at ./test.pl line 9. I then ran ulimit which came back as 'unlimited'. I'm betting SA's will not change server config or recompile kernel on my behalf so is this a dead-end?

2) BrowserUK, I would like to test threads. If your system was not Solaris 8, please let me know how I /msg you to get your test code. Or post with a caveat.

The last question I had was about leveraging additional cpu's. Can we coax perl to coax the OS to throw some additional CPU's onto the fire? Would this make any difference? Based on the time output above, is it fair to say that this process is completely IO bound? Meaning that adding cpu's would only increase IO WAIT.

Upon searching perlmonks for other large file challenges, I've seen references to sys::mmap and fork::parallelmanager. If anyone has used either of these (or others) and feel strongly about one, please let me know.

Replies are listed 'Best First'.
Re^5: Muy Large File
by BrowserUk (Patriarch) on Mar 19, 2005 at 07:03 UTC
    Tried increasing $BUFSIZE ||= 2**31; but got this err Negative length at...

    As you probably worked out, the parameter is being used as a signed 32-bit value, and 2**31 rolls over to a negative value.

    Tried subtracting 1 in this manner $BUFSIZE ||= 2**31-1 and got this err Out of memory during "large" request for 2147487744 bytes, total sbrk() is 108672 bytes at ./test.pl line 9. I then ran ulimit which came back as 'unlimited'. I'm betting SA's will not change server config or recompile kernel on my behalf so is this a dead-end?

    I have no knowledge of Solaris at all, but I think that whilst your server has 16GB of ram, it is probable that each of the 8 cpu's is limited to 2GB. This is a very common upper limit with 32-bit OS's. The theorectic upper limit is 4 GB, but often the other 2GB of each processes virtual address space is reserved by the OS for it's own purposes.

    For example, under NT, MS provide a set of APIs collectively known as "Address Windowing Extensions" that allow individual processes to access memory beyond the usual 2GB OS limits by allocating physical ram and mapping parts of it into the 32-bit/4GB address space. But the application needs to be written to use this facility, and it comes with considerable restrictions.

    The point is that settling for 2**30 is probably the best you would be able to do without getting a lot more familiar with the internals of your OS.

    That said. I would try 2**21, 2**22 & 2**23 first and carefully log the timings to see if using larger buffers actually results in better throughtput. It is quite possible that the extra worked required by the OS in marshalling that volume of contiguous address space will actually reduce your throughput. Indeed, you may find that you get just as good a throughput using 2**16 as you do with 2**20. It may even vary from run to run depending on the loading of the server and a whole host of other factors.

    Using ever larger buffers does not imply ever increasing throughput. It's fairly easy to see that if the standard read size is (say) 4k and your processing a 50 GB file, then your going to do 13 million read-search&modify-write cycles and therefore 13 million times any overhead involved in that cycle.

    If you increase your buffer size to 2**20, then you reduce that repetiton count to 50 thousand and thereby reduce the effects of any overhead to just 4%. And your OS will have no problems at all in allocating a 1 MB buffer, and reading the 1 MB from disk will easily happen in one timeslice, so there is little to negate the gain.

    If you increasing your readsize to 2**22, then your overheads reduce to less than 1% of the original, but only 25% of the 2**20. Worth having, but diminishing returns. Allocating 4 MB will again be no problem, but will the read still complete in a single timeslot? Probably, but you may be pushing the boundary. Ie, it is possible that you will introduce an extra delay through missing a possible timeslot whilst waiting for IO completion.

    By the time you get to 2**30, your gain over the 1 MB slot is very small, but you are now forcing the OS to marshall 1 GB of contiguous ram for your process, which itself may take many missed timeslots. And then asking the disk subsystem to read/write 1 GB at a time, which again will definitely introduce several missed timeslots in IOWAIT states. Overall, the gains versus losses will probably result in an overall loss of throughput.

    There is no simple calculation that will allow you to determine the breakpoints, nor even estimate them unless the machine is dedicated to the is one task. The best you can do is time several runs at different buffer sizes and look for the trends. In this, looking to maximise your processes cpu load, is probably your best indication of which direction you are heading. My best guess is that you will see little improvement above around 4 MB reads&writes, but the breakpoint may come much earlier, depending upon disk subsystem as much as anything else.


    Now we come to multitasking. In all cases, the problem will come down to whether your OS+Perl can correctly manage sharing access to a single file from multiple concurrent threads-of-execution (threads or processes). I'm not familiar with the operation of SysV memory mapping, though I think it may be similar to Win32 File Mapping objects. These would certainly allow processes or threads to process different chunks of a large file concurrently in an efficient and coordinated manner, but the APIs are non-portable and require a pretty deep understanding of the OS in question to use. I don't have that for Solaris so cannot advise, but there are is theSys::MMap module and I noticed that PerlIO has a ':mmap' layer, but it doesn't work for my OS so I am unfamailiar with it.


    Now to my threaded code. I have tried two different approaches to this.

    My first attempt, tried to overlap the IO and processing by reading into buffers on one thread and doing the translation and writing on a second thread. The idea was that if Perl/C runtime could handle this, I could then try and use more buffers and balance the number of read threads with the number of write threads to get best throughput. On my OS, something is being cached somewhere such that file gets corrupted.

    The code I tried is pretty unsophisticated, but was enough to convince me that it wouldn't work:

    My second attempt--which works (for me)--uses one thread to do all the reading, the main thread to do the transformation, and a third thread to do the writing. Again, the idea is to overlap the IO with the transformation, allowing the process to make best use of the timeslots it is allocated by utilising the time when the read/write threads are blocked in IO wait states to do other stuff. The data is passed between the threads via a couple of queues.

    The problem with this is that the iThreads model requires such shared data to be duplicated. It also has to be synchronised. Whilst this works on my system, as I do not have multiple cpus, I cannot tell you whether it will result in greater throughput or not. Nor whether it will continue to work correctly on a multi-cpu machine.

    So, I provide it only as an example--testing in your environment is down to you:).

    Because of the way iThreads work, and the way this is coded, I would suggest sticking with fairly small buffers 2**20 or 2**22 (and maybe 2**16 would be worth trying also).

    I'd appreciate any feedback you can give me if you do try this.

      Wow UK, you are da Monk! I hope you are a well paid professor or architect somewhere because you are obviously knowledgeable and helpful. It must have taken you quite some time to create your last response. Many many thanks. Your tome has already helped in the following manner.

      Bigger is certainly not always better!
      I cut the original test file in half to 4G for the purposes of this test. I then changed the buffer size from the original 2**30 to the below test sizes. As you can see below, going from 2**19 to 2**18 is pretty dramatic. In the range of 2**18 to 2**15, performance seems to be best.
      Using 2**18, I went back and tested with the original 8G which now runs in an amazing 2m33s as opposed to the original 10m. Obviously 18 seems to be a good number for least amount of work on this particular server. I am starting to understand better all of the data buckets between the HD controller, IO bus, OS, RAM and the code. Very interesting indeed. Seeing the smaller buffer size work faster shatters the myths that I have held for many years. A sincere thanks to you and the others on this. For my part, I will evangelize this when the opportunity rises.

      Regarding your thread code, I will be playing with it over the next few days and will post my findings when complete. To be honest, this will take some time for me to dissect and understand so my apologies if it seems delayed as I am sometimes a smacktard.

      One question with the regular code. One of the requirements I have is to create a log that indicates which record the TR actually modified. Any ideas on how to do this whilst retaining the performance? It would seem that looping through BUFSIZE would make sense, except the fixed width records will not perfectly align with the buffer size in most cases.

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #24

      real 4m8.87s
      user 0m53.57s
      sys 0m8.68s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #20

      real 4m25.99s
      user 0m53.58s
      sys 0m7.56s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #19

      real 3m46.35s
      user 0m53.61s
      sys 0m7.97s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18

      real 1m16.36s
      user 0m41.76s
      sys 0m32.58s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18

      real 1m16.45s
      user 0m41.64s
      sys 0m32.61s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18 (8G)

      real 2m33.92s
      user 1m22.58s
      sys 1m6.50s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #17

      real 1m17.21s
      user 0m41.64s
      sys 0m32.98s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #16

      real 1m18.92s
      user 0m40.60s
      sys 0m35.95s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #16

      real 1m19.06s
      user 0m41.74s
      sys 0m34.87s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #15

      real 1m20.50s
      user 0m41.34s
      sys 0m36.93s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #14

      real 1m25.35s
      user 0m41.45s
      sys 0m41.11s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #13

      real 1m33.98s
      user 0m42.82s
      sys 0m48.49s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #12

      real 1m56.25s
      user 0m47.20s
      sys 1m6.11s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #11

      real 2m25.52s
      user 0m54.13s
      sys 1m28.47s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #10

      real 3m24.98s
      user 1m4.65s
      sys 2m16.84s

      time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #8

      real 9m1.04s
      user 2m9.46s
      sys 6m44.87s

        One of the requirements I have is to create a log that indicates which record the TR actually modified. Any ideas on how to do this whilst retaining the performance?

        I would make the buffer size a multiple of the fixed record size. The non-power-of-two-ness may have a slight impact on the performance, but it will probably be negligable. I would then perform the translation on record-sized chunks of the buffer, by using substr as an lvalue; something like:

        my $recno = 0; while( sysread $FH, $buffer, $RECSIZE * $MULTIPLE ) { my $readPos = sysseek $FH, 0, 1; ## simulate "systell()". for( 0 .. $MULTIPLE - 1 ) { if( my $changed = substr( $buffer, $_ * $RECSIZE, $RECSIZE ) =~ tr[...][...] ) { print LOG "Changed $changed chars in record: ", $recno + $ +_; # Calculate positions of modified record. + my $writePos = ( $recno + $_ )* $RECSIZE ; ## Check this c +alc! Untested! sysseek $FH, $writePos, 0; syswrite $fh, substr( $buffer, $_ * $RECSIZE, $RECSIZE ); sysseek $FH, $readPos, 0; ## Restore read position if we m +oved it. } } $recno += $MULTIPLE; }

        There are few things to note here:

        • The read is a multiple of the fixed record size.
        • The records are translated in-place, but 1 at a time by using substr as an lvalue to step through the buffer.
        • tr/// returns a count of the modifications it makes thereby avoiding the need to make two passes.
        • I've shown only the modified records being re-written--and individually.

          Whether this is a good strategy will depend upon the frequency of modification.

          • If the frequency is low, re-writing small, sparse modifications should give a net gain over re-writing everything.
          • If the frequency is high, then rewriting the whole buffer in a single pass will be quicker.

            Even then, if some buffers do not require any modification, the avoiding re-writing those will pay double benefit by avoiding the need to back up the readhead as well as avoiding the actual write.

            You could make this decision dynamically. Build an array of the modified record numbers as you do the translation and defer the re-writing until you have processed an complete buffer. If the proportion of $MULTIPLE is greater than some cutoff, re-write the entire buffer, else do just the modified records individually.

            Implementing this, and deciding the breakpoints is left as an exercise for the reader :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
Re^5: Muy Large File
by tlm (Prior) on Mar 20, 2005 at 14:42 UTC
    please let me know how I /msg you to get your test code.

    See here and here.

    the lowliest monk