ezra has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I'm running into a bizarre rewinding file handle problem on Solaris 8 running a 32 bit perl 5.8.4 with PerlIO and large files. I'm

Totally run-of-the-mill stuff, nothing special at all. I started noticing that I was processing more records than wc -l reports that there are in the file. None of my test cases can reproduce this, and it didn't happen while walking through the debugger. To make matters worse, the number of records that the file "grows" by varies, but tends to be in the range of 500 per 100k. Notes:

If I call tell after each pause and before each continuation, I get:

20040928 13:51:27 startProcessInputFile(PHILLY): pausing; current file + offset is 11530000 20040928 13:51:40 continueInputLoad(PHILLY): continuing; current file +offset is 11477968 20040928 13:52:04 continueInputLoad(PHILLY): pausing; current file off +set is 23064612 20040928 13:52:19 continueInputLoad(PHILLY): continuing; current file +offset is 22977360 20040928 13:52:53 continueInputLoad(PHILLY): pausing; current file off +set is 34602683 20040928 13:53:07 continueInputLoad(PHILLY): continuing; current file +offset is 34593093 20040928 13:53:33 continueInputLoad(PHILLY): pausing; current file off +set is 46134989 20040928 13:53:47 continueInputLoad(PHILLY): continuing; current file +offset is 46040223 20040928 13:54:11 continueInputLoad(PHILLY): pausing; current file off +set is 57668448 20040928 13:54:24 continueInputLoad(PHILLY): continuing; current file +offset is 57664020 20040928 13:54:50 continueInputLoad(PHILLY): pausing; current file off +set is 69205366 20040928 13:55:06 continueInputLoad(PHILLY): continuing; current file +offset is 69145548 20040928 13:55:38 continueInputLoad(PHILLY): pausing; current file off +set is 80739978 20040928 13:55:52 continueInputLoad(PHILLY): continuing; current file +offset is 80667412 20040928 13:56:18 continueInputLoad(PHILLY): pausing; current file off +set is 92271131 20040928 13:56:35 continueInputLoad(PHILLY): continuing; current file +offset is 92205201 20040928 13:57:00 continueInputLoad(PHILLY): pausing; current file off +set is 103808049 20040928 13:57:16 continueInputLoad(PHILLY): continuing; current file +offset is 103807699 20040928 13:57:42 continueInputLoad(PHILLY): pausing; current file off +set is 115343814 20040928 13:57:55 continueInputLoad(PHILLY): continuing; current file +offset is 115293110 20040928 13:58:30 continueInputLoad(PHILLY): pausing; current file off +set is 126879579 20040928 13:58:45 continueInputLoad(PHILLY): continuing; current file +offset is 126807910

The obvious workaround is to stash the tell offset and just seek to it when I'm done, but the client is a large bank, and I'm not up for kludging this issue. Any ideas or insight would be greatly appreciated. Thanks.

UPDATE
Fixed by allowing more than 255 open file descriptors via exporting PERLIO=perlio; there's a tangential question that spawned out of this. I need to do some more forensics to determine what sort of errors to expect when a forked process can't grab STDERR to complain, and why this affected my FH rather than just failing downstream. For now, It Works, so I'm relatively happy. Thanks for the feedback.

Ezra

Replies are listed 'Best First'.
Re: drifting IO::File offset
by tachyon (Chancellor) on Sep 29, 2004 at 04:25 UTC

    Given that it is not repeatable on any other OS or even in a test case the two obvious fixes are to kludge it by stashing the value of tell() or to drop the use of IO::File. I have never been able to understand why people use a module that seems to me to be a totally useless use of OO to provide syntactic sugar. For the interested the offset drift is unusual in that it varies wildly and is nothing like common buffer sizes +/- a line.

    11530000 11477968 = -52032 23064612 22977360 = -87252 34602683 34593093 = -9590 46134989 46040223 = -94766 57668448 57664020 = -4428 69205366 69145548 = -59818 80739978 80667412 = -72566 92271131 92205201 = -65930 103808049 103807699 = -350 115343814 115293110 = -50704 126879579 126807910 = -71669

    I think graffs idea that you may be simultaneously be mixing calls that use/bypass stdio could be at the heart of the problem.

    cheers

    tachyon

      Thanks for posting the diffs. I calculated them, but just ended up rubbing my eyes, cursing, and getting another cup of coffee instead of doing anything useful with the info.

      RE: using a wrapper module, this code was written and in production with 5.00503 for a while, and back then I was using FileHandle to stash FH's as object members. It's been low in my refactoring list, up until this particular deployment.

Re: drifting IO::File offset
by graff (Chancellor) on Sep 29, 2004 at 04:10 UTC
    I'm afraid I don't have any insight about the problem -- seems like it shouldn't be happening (the most hateful sort of bug).

    So, the only thing happening between the "pausing" tell call and the "continuing" tell call is a few fork calls? I remember folks telling me recently that forked processes will share memory with the parent, but what you're seeing still should not happen.

    If you conclude that forking is somehow triggering this behavior (and you can't convince the bankers to switch to Linux ;), then IMHO it would not be viewed as "klugey" to provide commentary in your code that mentions the apparent instability of IO::File offset pointers when used in combination with forking, call "tell" before the forking is done and even close the file; then reopen and seek after forking.

    I can understand why you don't post a sample of your code in this case, but something to consider is to create a test-case script that you think might isolate the problem -- remove all "irrelevant" detail, and limit it to open file; while (whatever) {read 10K records; tell; fork...; tell }

    If the most minimal script does not reproduce the problem, start adding in details from the target app. At some point, you'll find the thing in your code that you thought wasn't there or wasn't relevant, etc. (At least, one can hope...)

    (update: the only other issue I could imagine bein relevant is to make sure you aren't doing anything that involves improper mixing of i/o styles -- e.g. if you're using getline and tell, you should not also be using any i/o function that starts with "sys". Of course, if you were, then I'd expect it to break under Linux as well.)

      So, the only thing happening between the "pausing" tell call and the "continuing" tell call is a few fork calls? I remember folks telling me recently that forked processes will share memory with the parent, but what you're seeing still should not happen.

      Actually, there's a ton of code between the pause and resume calls, but nothing that should be relevant to this filehandle. Trivial test scripts run fine, and this doesn't happen on identical or comparable hardware in other shops. The only changing variable in this scenario seems to be the client-specific configuration. That would influence things like the number of open file descriptors, number of open database handles, etc.

      I do use sysread/write in the same code suite for some unrelated socket interaction, but no other process or piece of code has any awareness of this object's filehandle. The object itself does gets shared across a bunch processes during the forks. Still, assuming that the forked process get CoW'd shared memory and each child gets a dupe of the open file table for that FD, and nobody moves the offset pointer, this shouldn't happening. AArGh.

      Anyway, I'll check for mixing I/O access methods just for sanity's sake. That's definitely a Good Thing to know about even if it doesn't solve this particular issue. Thanks for your time!

      Cheers,
      Ezra

Re: drifting IO::File offset
by dave_the_m (Monsignor) on Sep 29, 2004 at 09:10 UTC
    In Solaris, the child process when exiting, messes up the tell() position of the file handle it inherited from the parent. One not very satisfactory workaround is to make the child exit using POSIX::_exit() rather than the usual exit().

    Dave.

      Hi Dave,

      Why would the posix _exit() be unsatisfactory?

      thanks :-)

      Jason L. Froebe

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        Why would the posix _exit() be unsatisfactory?
        Because if for any reason whatsoever, a child process somehow accidently exits in any other way, the parent is screwed. So it's a bit of a fragile mechanism.

        Dave.

        Why would the posix _exit() be unsatisfactory?

        One word. Kludge.

        Fixing the underlying OS bug would be satisfactory, everything else is a workaround and therefore not very satisfactory. That said something that works is by definition beter than something that does not.

        cheers

        tachyon

      Hey Dave, just out of curiousity, do you have any documentation on this Solaris 'feature'? Also, do you know offhand if calling POSIX::_exit() will honor DESTROY hints like DBH's InactiveDestroy? I'll test all this later today when I get to work, but any heads-up on specifics would be great. Thanks...

      Ezra

        do you have any documentation on this Solaris 'feature'?
        Well, I know about it because of a Perl bug report from a few weeks ago. I'm not sure whether we got as far as resolving whether it's Perl's or Solaris's fault, and I haven't had time yet to go back and look at it further. Solaris is trying to backout of a stdio buffered read before exiting.
        do you know offhand if calling POSIX::_exit() will honor DESTROY
        _exit() does an immediate process exit without doing any cleanup of any kind, either at the Perl or C level.

        Dave.

      Very interesting, and near what I was expecting. I need further testing with this. Thanks...