Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^6: Muy Large File

by BuddhaLovesPerl (Sexton)
on Mar 20, 2005 at 23:45 UTC ( #441078=note: print w/replies, xml ) Need Help??


in reply to Re^5: Muy Large File
in thread Muy Large File

Wow UK, you are da Monk! I hope you are a well paid professor or architect somewhere because you are obviously knowledgeable and helpful. It must have taken you quite some time to create your last response. Many many thanks. Your tome has already helped in the following manner.

Bigger is certainly not always better!
I cut the original test file in half to 4G for the purposes of this test. I then changed the buffer size from the original 2**30 to the below test sizes. As you can see below, going from 2**19 to 2**18 is pretty dramatic. In the range of 2**18 to 2**15, performance seems to be best.
Using 2**18, I went back and tested with the original 8G which now runs in an amazing 2m33s as opposed to the original 10m. Obviously 18 seems to be a good number for least amount of work on this particular server. I am starting to understand better all of the data buckets between the HD controller, IO bus, OS, RAM and the code. Very interesting indeed. Seeing the smaller buffer size work faster shatters the myths that I have held for many years. A sincere thanks to you and the others on this. For my part, I will evangelize this when the opportunity rises.

Regarding your thread code, I will be playing with it over the next few days and will post my findings when complete. To be honest, this will take some time for me to dissect and understand so my apologies if it seems delayed as I am sometimes a smacktard.

One question with the regular code. One of the requirements I have is to create a log that indicates which record the TR actually modified. Any ideas on how to do this whilst retaining the performance? It would seem that looping through BUFSIZE would make sense, except the fixed width records will not perfectly align with the buffer size in most cases.

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #24

real 4m8.87s
user 0m53.57s
sys 0m8.68s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #20

real 4m25.99s
user 0m53.58s
sys 0m7.56s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #19

real 3m46.35s
user 0m53.61s
sys 0m7.97s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18

real 1m16.36s
user 0m41.76s
sys 0m32.58s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18

real 1m16.45s
user 0m41.64s
sys 0m32.61s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #18 (8G)

real 2m33.92s
user 1m22.58s
sys 1m6.50s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #17

real 1m17.21s
user 0m41.64s
sys 0m32.98s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #16

real 1m18.92s
user 0m40.60s
sys 0m35.95s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #16

real 1m19.06s
user 0m41.74s
sys 0m34.87s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #15

real 1m20.50s
user 0m41.34s
sys 0m36.93s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #14

real 1m25.35s
user 0m41.45s
sys 0m41.11s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #13

real 1m33.98s
user 0m42.82s
sys 0m48.49s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #12

real 1m56.25s
user 0m47.20s
sys 1m6.11s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #11

real 2m25.52s
user 0m54.13s
sys 1m28.47s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #10

real 3m24.98s
user 1m4.65s
sys 2m16.84s

time /apps/p_dm200/ndm_ip_pull/tmp/test.pl #8

real 9m1.04s
user 2m9.46s
sys 6m44.87s

Replies are listed 'Best First'.
Re^7: Muy Large File
by BrowserUk (Patriarch) on Mar 21, 2005 at 00:48 UTC
    One of the requirements I have is to create a log that indicates which record the TR actually modified. Any ideas on how to do this whilst retaining the performance?

    I would make the buffer size a multiple of the fixed record size. The non-power-of-two-ness may have a slight impact on the performance, but it will probably be negligable. I would then perform the translation on record-sized chunks of the buffer, by using substr as an lvalue; something like:

    my $recno = 0; while( sysread $FH, $buffer, $RECSIZE * $MULTIPLE ) { my $readPos = sysseek $FH, 0, 1; ## simulate "systell()". for( 0 .. $MULTIPLE - 1 ) { if( my $changed = substr( $buffer, $_ * $RECSIZE, $RECSIZE ) =~ tr[...][...] ) { print LOG "Changed $changed chars in record: ", $recno + $ +_; # Calculate positions of modified record. + my $writePos = ( $recno + $_ )* $RECSIZE ; ## Check this c +alc! Untested! sysseek $FH, $writePos, 0; syswrite $fh, substr( $buffer, $_ * $RECSIZE, $RECSIZE ); sysseek $FH, $readPos, 0; ## Restore read position if we m +oved it. } } $recno += $MULTIPLE; }

    There are few things to note here:

    • The read is a multiple of the fixed record size.
    • The records are translated in-place, but 1 at a time by using substr as an lvalue to step through the buffer.
    • tr/// returns a count of the modifications it makes thereby avoiding the need to make two passes.
    • I've shown only the modified records being re-written--and individually.

      Whether this is a good strategy will depend upon the frequency of modification.

      • If the frequency is low, re-writing small, sparse modifications should give a net gain over re-writing everything.
      • If the frequency is high, then rewriting the whole buffer in a single pass will be quicker.

        Even then, if some buffers do not require any modification, the avoiding re-writing those will pay double benefit by avoiding the need to back up the readhead as well as avoiding the actual write.

        You could make this decision dynamically. Build an array of the modified record numbers as you do the translation and defer the re-writing until you have processed an complete buffer. If the proportion of $MULTIPLE is greater than some cutoff, re-write the entire buffer, else do just the modified records individually.

        Implementing this, and deciding the breakpoints is left as an exercise for the reader :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://441078]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2022-07-02 17:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (103 votes). Check out past polls.

    Notices?