Re: Buffered IO and un-intended slurping
by Marshall (Canon) on Dec 31, 2009 at 23:02 UTC
|
I don't know that I can be of much help. I'll give some generic "checklist" advice:
I've implemented fixed size record code before with standard buffered read/write/seek. There are a couple of ways to go wrong. 1) Make sure that your "byte math" is right. You didn't say whether you are on Windows or not, but Windows uses CR,LF instead of just LF for \n (like Unix). 2)On most file systems you need to do an intervening seek or tell operation when switching from read to write (this can flush buffers amongst other things). The normal way is "seek (HANDLE,0,SEEK_CUR);". That seeks 0 bytes from current position and therefore is a logical "do nothing", but some things happen under the covers. It could be that this usually happens by accident, but occasionally it doesn't. 3) If you are going to use the unbuffered read/writes, use sysseek() instead of seek(). Of course don't mix buffered and unbuffered operations on the same file. 4)These fixed record length files normally should have every byte written in them unless you know that you are creating a "sparse file". It could possibly be that you have accidentally created a sparse file.
Other things: seek() does return a status of worked or didn't work. Normally seek() doesn't fail, but it might be interesting to see if that happens or not. An attempted seek to before beginning of file could cause some trouble. Also, it is possible to seek past the EOF. That is completely legal even on Win XP's NTFS. This is used to create "sparse" files, files that have gaps in them. This kind of file will report its size in ls or dir as the "theoretical max", but "du" in Unix will tell a different story. Maybe sometime you are seeking out there into "no man's land" and trying to read. That might produce some really strange results.
Of course it would be helpful it you could say that OS and Perl version that you are using. You can't post some 15MB data file here, but if you could come up with some short gizmo that generates some dummy data and is able to demonstrate the problem, that would be helpful. What you are trying to do should work with buffered I/O and normal seek(). | [reply] |
|
|
The missing information:
1) Linux system, \R = \n
2) Files are opened O_RDONLY, no writing
3) All IO in this module is either buffered or all 'sys'; no mixing of the two
4) I have added code to do a sysseek to 0,CUR and compared that which the computed offset. This has not yet fired, to indicate a failed seek. But I have removed all the buffered mode code.
5) The files are continuous. They were created from text databases with shell utilities
6) Possibly relevant is that this module is running in multiple processes, potentially at the same time. But simultanious reads should not be a problem....
It is always better to have seen your target for yourself, rather than depend upon someone else's description.
| [reply] |
|
|
6) Possibly relevant is that this module is running in multiple processes, potentially at the same time. But simultaneous reads should not be a problem....
Oh, to further clarify, checking seek() return should be on the all the seeks, the "dummy seek to the same position" shouldn't fail. Of course "shouldn't" doesn't mean that it couldn't! Normally tell() should do same "under the covers" thing and yield the current byte position to check against "legal byte positions". You can put an "or die "xxxx" " on all seek(),tell()'s. I was mainly concerned that sometimes perhaps you seek past EOF or before BOF and all sorts of "bad" things could happen, some of them non-obvious.
With a failure 1/1,000 operations, this is seldom enough that there isn't some super obvious thing, but it is often enough to be able to re-create the problem in some "reasonable time frame". I presume some hours or even minutes.
One question I have is: Are you able to recreate the problem with only one process running? An overnight run with many thousand's queries without errors with one process might be a clue.
I have certainly opened the same file from multiple process and used read sequential many times...a normal sort of thing to do. I am wondering if somehow on your system, somehow the seeks() are causing problems. I suppose that it is possible that your code is fine and the OS and its file system occasionally goofs.
I have done what you are trying to do, but my thing that did the seeking, reading, writing was just a single process.
| [reply] |
Re: Buffered IO and un-intended slurping
by BrowserUk (Patriarch) on Jan 01, 2010 at 00:11 UTC
|
I don't have a better answer for you than the one you already thought of--$/ getting undef'd--but it strikes me that with < 600MB, you'd get far better performance by slurping the whole file into memory.
If you loaded it into a scalar and then opened that scalar as a ram-file, you needn't change your existing code, but it would run perhaps 2 orders of magnitude or more, more quickly.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
My first thought was "how do I seek() on a scalar?" Buy you mention "opened that scalar as a ram-file" which seems to imply there is a way to do just that
How about a pointer to the method?
It is always better to have seen your target for yourself, rather than depend upon someone else's description.
| [reply] |
|
|
open my $fhFile, '<', 'theFile' or die $!;
my $slurpedFile;
{
local $/;
$slurpedFile = <$fhFile>;
}
close $fhFile;
## Supplying a reference to a scalar as the filename
## initiates the opening of a ram file.
open my $fhRam, '<', \$slurpedFile or die $!;
## Now file operations on $fhRam will read from the slurped scalar
my $firstLine = <$fhRam>;
...
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Buffered IO and un-intended slurping
by dk (Chaplain) on Jan 01, 2010 at 11:35 UTC
|
$record=<INDX> is alias for readline, which is buffered ( AFAIK at least when $/ is a newline). As others rightly noted, mixing buffered and non-buffered I/O is not a good idea. It is also not clear also whether your $/ is a newline or 41, - in the latter case it simply can be a perl bug, and all the best practices with locating and reporting it do apply.
Also, you said that you've converted code into using sysread, but I can't think about how sysread($f, $buf, 41) would slurp anything beyond the 41th byte. I'd however use buffered I/O here, but the same question
goes to read($f, $buf, 41) as well. | [reply] [d/l] [select] |
|
|
There was no mixing of buffered and unbuffered on a file. It all started out buffered; after the lost positioning I converted all the IO calls to unbuffered for that file.
41 is the record size... 40 bytes for the SHA1, 1 byte for the newline .... 40+1=41. I never even touch the $/ value in the program
And keep in mind that this happened just 1 in a 1000 times this module was called.
It is always better to have seen your target for yourself, rather than depend upon someone else's description.
| [reply] |
|
|
If, as you say, you never mix buffered and unbuffered I/O, and never alter $/, then there's something fishy and unexpected. I sincerely doubt that seek() calls, even if done on a buffered stream and even with mixed dos and unix newlines, would produce such an effect. I'd investigate further to find what causes the slurping, at least to make clear if that's a perl bug or not.
otoh, if you're only interested in a practical solution, just switch to read($f, $buf, 41) instead of readline, which depends on $/.
| [reply] [d/l] [select] |