drifting IO::File offset

ezra has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I'm running into a bizarre rewinding file handle problem on Solaris 8 running a 32 bit perl 5.8.4 with PerlIO and large files. I'm

opening an IO::File object
calling getline on it, parsing/munging
pausing after reading 10k records
forking some worker processes, none of which manipulate or even read from the filehandle in question; even if any of the forked processes were reading from the shared fd, this instance's offset pointer shouldn't be getting clobbered
rinse, repeat until all records are consumed

Totally run-of-the-mill stuff, nothing special at all. I started noticing that I was processing more records than wc -l reports that there are in the file. None of my test cases can reproduce this, and it didn't happen while walking through the debugger. To make matters worse, the number of records that the file "grows" by varies, but tends to be in the range of 500 per 100k. Notes:

it is not reproduceable on Linux
it is not reproduceable in the debugger
it is not a write-caching problem on the customer's SAN
it is not reproduceable via any tests i've written, including forking, opening another process to hold onto 200+ fd's while spinning an IO::File, different v880, etc
the offset drift value varies greatly, and doesn't seem to have any correspondence to record or buffer size

If I call tell after each pause and before each continuation, I get:

20040928 13:51:27 startProcessInputFile(PHILLY): pausing; current file
+ offset is 11530000
20040928 13:51:40 continueInputLoad(PHILLY): continuing; current file 
+offset is 11477968
20040928 13:52:04 continueInputLoad(PHILLY): pausing; current file off
+set is 23064612
20040928 13:52:19 continueInputLoad(PHILLY): continuing; current file 
+offset is 22977360
20040928 13:52:53 continueInputLoad(PHILLY): pausing; current file off
+set is 34602683
20040928 13:53:07 continueInputLoad(PHILLY): continuing; current file 
+offset is 34593093
20040928 13:53:33 continueInputLoad(PHILLY): pausing; current file off
+set is 46134989
20040928 13:53:47 continueInputLoad(PHILLY): continuing; current file 
+offset is 46040223
20040928 13:54:11 continueInputLoad(PHILLY): pausing; current file off
+set is 57668448
20040928 13:54:24 continueInputLoad(PHILLY): continuing; current file 
+offset is 57664020
20040928 13:54:50 continueInputLoad(PHILLY): pausing; current file off
+set is 69205366
20040928 13:55:06 continueInputLoad(PHILLY): continuing; current file 
+offset is 69145548
20040928 13:55:38 continueInputLoad(PHILLY): pausing; current file off
+set is 80739978
20040928 13:55:52 continueInputLoad(PHILLY): continuing; current file 
+offset is 80667412
20040928 13:56:18 continueInputLoad(PHILLY): pausing; current file off
+set is 92271131
20040928 13:56:35 continueInputLoad(PHILLY): continuing; current file 
+offset is 92205201
20040928 13:57:00 continueInputLoad(PHILLY): pausing; current file off
+set is 103808049
20040928 13:57:16 continueInputLoad(PHILLY): continuing; current file 
+offset is 103807699
20040928 13:57:42 continueInputLoad(PHILLY): pausing; current file off
+set is 115343814
20040928 13:57:55 continueInputLoad(PHILLY): continuing; current file 
+offset is 115293110
20040928 13:58:30 continueInputLoad(PHILLY): pausing; current file off
+set is 126879579
20040928 13:58:45 continueInputLoad(PHILLY): continuing; current file 
+offset is 126807910
[download]

The obvious workaround is to stash the tell offset and just seek to it when I'm done, but the client is a large bank, and I'm not up for kludging this issue. Any ideas or insight would be greatly appreciated. Thanks.

UPDATE
Fixed by allowing more than 255 open file descriptors via exporting PERLIO=perlio; there's a tangential question that spawned out of this. I need to do some more forensics to determine what sort of errors to expect when a forked process can't grab STDERR to complain, and why this affected my FH rather than just failing downstream. For now, It Works, so I'm relatively happy. Thanks for the feedback.

Ezra

Comment on drifting IO::File offset Select or Download Code

Replies are listed 'Best First'.
Re: drifting IO::File offset by tachyon (Chancellor) on Sep 29, 2004 at 04:25 UTC
Given that it is not repeatable on any other OS or even in a test case the two obvious fixes are to kludge it by stashing the value of tell() or to drop the use of IO::File. I have never been able to understand why people use a module that seems to me to be a totally useless use of OO to provide syntactic sugar. For the interested the offset drift is unusual in that it varies wildly and is nothing like common buffer sizes +/- a line. `11530000 11477968 = -52032 23064612 22977360 = -87252 34602683 34593093 = -9590 46134989 46040223 = -94766 57668448 57664020 = -4428 69205366 69145548 = -59818 80739978 80667412 = -72566 92271131 92205201 = -65930 103808049 103807699 = -350 115343814 115293110 = -50704 126879579 126807910 = -71669` [download] I think graffs idea that you may be simultaneously be mixing calls that use/bypass stdio could be at the heart of the problem. cheers tachyon	[reply] [d/l]
Re^2: drifting IO::File offset by ezra (Scribe) on Sep 29, 2004 at 13:33 UTC
Thanks for posting the diffs. I calculated them, but just ended up rubbing my eyes, cursing, and getting another cup of coffee instead of doing anything useful with the info. RE: using a wrapper module, this code was written and in production with 5.00503 for a while, and back then I was using FileHandle to stash FH's as object members. It's been low in my refactoring list, up until this particular deployment.	[reply]
Re: drifting IO::File offset by graff (Chancellor) on Sep 29, 2004 at 04:10 UTC
I'm afraid I don't have any insight about the problem -- seems like it shouldn't be happening (the most hateful sort of bug). So, the only thing happening between the "pausing" tell call and the "continuing" tell call is a few fork calls? I remember folks telling me recently that forked processes will share memory with the parent, but what you're seeing still should not happen. If you conclude that forking is somehow triggering this behavior (and you can't convince the bankers to switch to Linux ;), then IMHO it would not be viewed as "klugey" to provide commentary in your code that mentions the apparent instability of IO::File offset pointers when used in combination with forking, call "tell" before the forking is done and even close the file; then reopen and seek after forking. I can understand why you don't post a sample of your code in this case, but something to consider is to create a test-case script that you think might isolate the problem -- remove all "irrelevant" detail, and limit it to `open file; while (whatever) {read 10K records; tell; fork...; tell }` If the most minimal script does not reproduce the problem, start adding in details from the target app. At some point, you'll find the thing in your code that you thought wasn't there or wasn't relevant, etc. (At least, one can hope...) (update: the only other issue I could imagine bein relevant is to make sure you aren't doing anything that involves improper mixing of i/o styles -- e.g. if you're using getline and tell, you should not also be using any i/o function that starts with "sys". Of course, if you were, then I'd expect it to break under Linux as well.)	[reply] [d/l]
Re^2: drifting IO::File offset by ezra (Scribe) on Sep 29, 2004 at 13:59 UTC
So, the only thing happening between the "pausing" tell call and the "continuing" tell call is a few fork calls? I remember folks telling me recently that forked processes will share memory with the parent, but what you're seeing still should not happen. Actually, there's a ton of code between the pause and resume calls, but nothing that should be relevant to this filehandle. Trivial test scripts run fine, and this doesn't happen on identical or comparable hardware in other shops. The only changing variable in this scenario seems to be the client-specific configuration. That would influence things like the number of open file descriptors, number of open database handles, etc. I do use sysread/write in the same code suite for some unrelated socket interaction, but no other process or piece of code has any awareness of this object's filehandle. The object itself does gets shared across a bunch processes during the forks. Still, assuming that the forked process get CoW'd shared memory and each child gets a dupe of the open file table for that FD, and nobody moves the offset pointer, this shouldn't happening. AArGh. Anyway, I'll check for mixing I/O access methods just for sanity's sake. That's definitely a Good Thing to know about even if it doesn't solve this particular issue. Thanks for your time! Cheers, Ezra	[reply]
Re: drifting IO::File offset by dave_the_m (Monsignor) on Sep 29, 2004 at 09:10 UTC
In Solaris, the child process when exiting, messes up the tell() position of the file handle it inherited from the parent. One not very satisfactory workaround is to make the child exit using POSIX::_exit() rather than the usual exit(). Dave.	[reply]
Re^2: drifting IO::File offset by jfroebe (Parson) on Sep 29, 2004 at 11:21 UTC
Hi Dave, Why would the posix _exit() be unsatisfactory? thanks :-) Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re^3: drifting IO::File offset by dave_the_m (Monsignor) on Sep 29, 2004 at 11:30 UTC
Why would the posix _exit() be unsatisfactory? Because if for any reason whatsoever, a child process somehow accidently exits in any other way, the parent is screwed. So it's a bit of a fragile mechanism. Dave.	[reply]
Re^3: drifting IO::File offset by tachyon (Chancellor) on Sep 29, 2004 at 11:32 UTC
Why would the posix _exit() be unsatisfactory? One word. Kludge. Fixing the underlying OS bug would be satisfactory, everything else is a workaround and therefore not very satisfactory. That said something that works is by definition beter than something that does not. cheers tachyon	[reply]
Re^2: drifting IO::File offset by ezra (Scribe) on Sep 29, 2004 at 13:39 UTC
Hey Dave, just out of curiousity, do you have any documentation on this Solaris 'feature'? Also, do you know offhand if calling `POSIX::_exit()` will honor DESTROY hints like DBH's InactiveDestroy? I'll test all this later today when I get to work, but any heads-up on specifics would be great. Thanks... Ezra	[reply] [d/l]
Re^3: drifting IO::File offset by dave_the_m (Monsignor) on Sep 29, 2004 at 17:32 UTC
do you have any documentation on this Solaris 'feature'? Well, I know about it because of a Perl bug report from a few weeks ago. I'm not sure whether we got as far as resolving whether it's Perl's or Solaris's fault, and I haven't had time yet to go back and look at it further. Solaris is trying to backout of a stdio buffered read before exiting. do you know offhand if calling POSIX::_exit() will honor DESTROY _exit() does an immediate process exit without doing any cleanup of any kind, either at the Perl or C level. Dave.	[reply] [d/l]
Re^2: drifting IO::File offset by ezra (Scribe) on Sep 29, 2004 at 12:19 UTC
Very interesting, and near what I was expecting. I need further testing with this. Thanks...	[reply]