mikeraz has asked for the wisdom of the Perl Monks concerning the following question:

Update:
It's one of the filesystem, the OS or the clib on the host platform. Perl is able to handle everything the host is able to throw at it, but the host system is spraining itself on the dataset. I've re-created the problem environment on other systems (different host OS) and have had no problems working with file up to (at this time) 18G.

I've just returned from perusing the archives here without finding an appropriate repsonse to my problem. I have the debug output from a process and it's some 11G in size. This is on a Solaris system with largefile support:

ettest:/opt/qipgen_log $cat /etc/mnttab ... /dev/dsk/c0t9d0s0 /a ufs rw,intr,largefiles,xattr,onerr +or=panic,suid,dev=800040 1123102217 ettest:/opt/qipgen_log $perl -V | less Summary of my perl5 (revision 5.0 version 6 subversion 1) configuratio +n: Platform: osname=solaris, osvers=2.9, archname=sun4-solaris-64int uname='sunos localhost 5.9 sun4u sparc sunw,ultra-1' ... useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef ... Compiler: cc='cc', ccflags ='-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
split failed after some 44,000,000 lines; Perl is getting to ~ 43,835,940 lines in a
<br> perl -ne '$prn++ if /18:04:54.631:/; print if $prn' FILENAME
construct.

Because I have the first 44,000,000 lines extracted through split the objective has been to pull off the last X lines. (tail -1000000 FILENAME is failing).

Suggestions?

Michael 'yes, next time I'll use the option for the source program to make lots of smaller files' R

Be Appropriate && Follow Your Curiosity

Replies are listed 'Best First'.
Re: Largefile failure
by sk (Curate) on Aug 04, 2005 at 22:39 UTC
    I work with such large files too but I use data processing software (such as SAS and other proprietary ones)

    Anyways here are some wacky thoughts!

    0. Autoflush using some perl IO modules? or  $|++?

    1. Compress the file and zcat the file to perl?

    2. You are able to read certain number of lines in perl. Keep track of the bytes and then seek next time after fail? Or split the prog into many seek pieces?

    3. Load it into MySQL and then access?

    (Never tried that large file in MySQL tho).

    good luck and let us know if you find any nifty work arounds!!

    -SK

    PS: In unix can you do  limit and make sure your don't have a limit on filesize for your session?

Re: Largefile failure
by BrowserUk (Patriarch) on Aug 05, 2005 at 03:13 UTC

    The fact that both split and perl are failing at approximately the same place suggests that your C-lib has a problem with files over 4 GB.

    44,000,000 lines is approximately 4GB with the lines being an average of 97 chars each.

    Can you make the file visible to a system known to handle files > 4GB via the network? It woudl be slow, but easier perhaps than upgrading the original system.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Largefile failure
by borisz (Canon) on Aug 04, 2005 at 22:05 UTC
    I had a similar problem. My solution was to compile a recent perl on the machine in question. I do not know what was the solution, the recompile or the newer 5.8 perl.
    Boris

      If only I had access to a compiler on this system . . .

      Be Appropriate && Follow Your Curiosity
Re: Largefile failure
by mikeraz (Friar) on Aug 05, 2005 at 15:07 UTC

    sk suggested checking with ulimit

    ettest:/usr/local/tmp $ulimit unlimited
    Lying scum. As BrowserUK pointed out the failures are happening around the 4GB mark.

    I wrote a bit of code to test seeking and reading:

    #!/usr/bin/perl # try seek to get around my problems with 11G file # $have value derived from the total size of the splits I've extracted $have = 4912152576; open BIG, "<QAPI.0.log" or die "cannot open QAPI ... $@"; $ret = sysseek BIG, $have, 0; print STDERR "sysseeked to $ret \n"; $ret = sysread BIG, $data, 1024; print STDERR "sysread $ret\n"; print $data;
    And that fails.
    Folks, Thank You for the suggestions. I'm going to try to compress(1) the file. If that fails I'm going to scrub it and redo the process and regenerate the log as a series of smaller files. I can't take up any more of my employer's time on this one off problem.

    Now to recreate it at home...

    Be Appropriate && Follow Your Curiosity