Wiggins has asked for the wisdom of the Perl Monks concerning the following question:

This is more a contemplation, that a search for a solution.

I have created an old classic ISAM structure of SHA1 values. A data file of 15+ Million ordered SHA1s in a file of fixed length records (SHA1\n = 41 bytes each). Then an index file containing ordered SHA1s with corresponding record position (fixed length) pointing to every 1000th record of the data file.

I do a binary search (seek & read) of the index, locate the record pointing to the data file just below the desired SHA1; then seek the data file to the indicated record, and do a sequential search of up to 1001 records until I match or pass the target value. Classic data structure stuff.

During this processing, on rare occasions (1 in 1000 searches), I read a record with "$record=<INDX>;" syntax and end up with all the remainder of the file in the scalar! As if the code had forgotten (or undef'ed) the input record separator and 'slurped' the remainder of the file. This became apparent finding 100s of pages of SHA1 values (1 per line) in the debug log where I had just printing the $record variable I had read.

--Solution, so far--
This was coded using the Perl buffered IO (opens, seek, read). That was not the best choice for doing a random access binary search of fixed length records (and there was/is something that breaks occasionally). So I switched the index search to unbuffered "sys[open|seek|read]" and haven't had this slurping problem with the index. The data file is read sequentially, so buffered access would be useful here. But eventually, the same slurping happened with the data file. I modified all the file IO code to use only the 'sys....' calls..

Any thoughts on the unexpected IO slurping problem?

It is always better to have seen your target for yourself, rather than depend upon someone else's description.

Replies are listed 'Best First'.
Re: Buffered IO and un-intended slurping
by Marshall (Canon) on Dec 31, 2009 at 23:02 UTC
    I don't know that I can be of much help. I'll give some generic "checklist" advice:

    I've implemented fixed size record code before with standard buffered read/write/seek. There are a couple of ways to go wrong.
    1) Make sure that your "byte math" is right. You didn't say whether you are on Windows or not, but Windows uses CR,LF instead of just LF for \n (like Unix).
    2)On most file systems you need to do an intervening seek or tell operation when switching from read to write (this can flush buffers amongst other things). The normal way is "seek (HANDLE,0,SEEK_CUR);". That seeks 0 bytes from current position and therefore is a logical "do nothing", but some things happen under the covers. It could be that this usually happens by accident, but occasionally it doesn't.
    3) If you are going to use the unbuffered read/writes, use sysseek() instead of seek(). Of course don't mix buffered and unbuffered operations on the same file.
    4)These fixed record length files normally should have every byte written in them unless you know that you are creating a "sparse file". It could possibly be that you have accidentally created a sparse file.

    Other things: seek() does return a status of worked or didn't work. Normally seek() doesn't fail, but it might be interesting to see if that happens or not. An attempted seek to before beginning of file could cause some trouble. Also, it is possible to seek past the EOF. That is completely legal even on Win XP's NTFS. This is used to create "sparse" files, files that have gaps in them. This kind of file will report its size in ls or dir as the "theoretical max", but "du" in Unix will tell a different story. Maybe sometime you are seeking out there into "no man's land" and trying to read. That might produce some really strange results.

    Of course it would be helpful it you could say that OS and Perl version that you are using. You can't post some 15MB data file here, but if you could come up with some short gizmo that generates some dummy data and is able to demonstrate the problem, that would be helpful. What you are trying to do should work with buffered I/O and normal seek().

      The missing information:
      1) Linux system, \R = \n
      2) Files are opened O_RDONLY, no writing
      3) All IO in this module is either buffered or all 'sys'; no mixing of the two
      4) I have added code to do a sysseek to 0,CUR and compared that which the computed offset. This has not yet fired, to indicate a failed seek. But I have removed all the buffered mode code.
      5) The files are continuous. They were created from text databases with shell utilities
      6) Possibly relevant is that this module is running in multiple processes, potentially at the same time. But simultanious reads should not be a problem....

      It is always better to have seen your target for yourself, rather than depend upon someone else's description.

        6) Possibly relevant is that this module is running in multiple processes, potentially at the same time. But simultaneous reads should not be a problem....

        Oh, to further clarify, checking seek() return should be on the all the seeks, the "dummy seek to the same position" shouldn't fail. Of course "shouldn't" doesn't mean that it couldn't! Normally tell() should do same "under the covers" thing and yield the current byte position to check against "legal byte positions". You can put an "or die "xxxx" " on all seek(),tell()'s. I was mainly concerned that sometimes perhaps you seek past EOF or before BOF and all sorts of "bad" things could happen, some of them non-obvious.

        With a failure 1/1,000 operations, this is seldom enough that there isn't some super obvious thing, but it is often enough to be able to re-create the problem in some "reasonable time frame". I presume some hours or even minutes.

        One question I have is: Are you able to recreate the problem with only one process running? An overnight run with many thousand's queries without errors with one process might be a clue.

        I have certainly opened the same file from multiple process and used read sequential many times...a normal sort of thing to do. I am wondering if somehow on your system, somehow the seeks() are causing problems. I suppose that it is possible that your code is fine and the OS and its file system occasionally goofs.

        I have done what you are trying to do, but my thing that did the seeking, reading, writing was just a single process.

Re: Buffered IO and un-intended slurping
by BrowserUk (Patriarch) on Jan 01, 2010 at 00:11 UTC

    I don't have a better answer for you than the one you already thought of--$/ getting undef'd--but it strikes me that with < 600MB, you'd get far better performance by slurping the whole file into memory.

    If you loaded it into a scalar and then opened that scalar as a ram-file, you needn't change your existing code, but it would run perhaps 2 orders of magnitude or more, more quickly.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      My first thought was "how do I seek() on a scalar?" Buy you mention "opened that scalar as a ram-file" which seems to imply there is a way to do just that
      How about a pointer to the method?

      It is always better to have seen your target for yourself, rather than depend upon someone else's description.

        See open for details:

        open my $fhFile, '<', 'theFile' or die $!; my $slurpedFile; { local $/; $slurpedFile = <$fhFile>; } close $fhFile; ## Supplying a reference to a scalar as the filename ## initiates the opening of a ram file. open my $fhRam, '<', \$slurpedFile or die $!; ## Now file operations on $fhRam will read from the slurped scalar my $firstLine = <$fhRam>; ...

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Buffered IO and un-intended slurping
by dk (Chaplain) on Jan 01, 2010 at 11:35 UTC
    $record=<INDX> is alias for readline, which is buffered ( AFAIK at least when $/ is a newline). As others rightly noted, mixing buffered and non-buffered I/O is not a good idea. It is also not clear also whether your $/ is a newline or 41, - in the latter case it simply can be a perl bug, and all the best practices with locating and reporting it do apply.

    Also, you said that you've converted code into using sysread, but I can't think about how sysread($f, $buf, 41) would slurp anything beyond the 41th byte. I'd however use buffered I/O here, but the same question goes to read($f, $buf, 41) as well.

      There was no mixing of buffered and unbuffered on a file. It all started out buffered; after the lost positioning I converted all the IO calls to unbuffered for that file.

      41 is the record size... 40 bytes for the SHA1, 1 byte for the newline .... 40+1=41. I never even touch the $/ value in the program

      And keep in mind that this happened just 1 in a 1000 times this module was called.

      It is always better to have seen your target for yourself, rather than depend upon someone else's description.

        If, as you say, you never mix buffered and unbuffered I/O, and never alter $/, then there's something fishy and unexpected. I sincerely doubt that seek() calls, even if done on a buffered stream and even with mixed dos and unix newlines, would produce such an effect. I'd investigate further to find what causes the slurping, at least to make clear if that's a perl bug or not.

        otoh, if you're only interested in a practical solution, just switch to read($f, $buf, 41) instead of readline, which depends on $/.