This is more a contemplation, that a search for a solution.

I have created an old classic ISAM structure of SHA1 values. A data file of 15+ Million ordered SHA1s in a file of fixed length records (SHA1\n = 41 bytes each). Then an index file containing ordered SHA1s with corresponding record position (fixed length) pointing to every 1000th record of the data file.

I do a binary search (seek & read) of the index, locate the record pointing to the data file just below the desired SHA1; then seek the data file to the indicated record, and do a sequential search of up to 1001 records until I match or pass the target value. Classic data structure stuff.

During this processing, on rare occasions (1 in 1000 searches), I read a record with "$record=<INDX>;" syntax and end up with all the remainder of the file in the scalar! As if the code had forgotten (or undef'ed) the input record separator and 'slurped' the remainder of the file. This became apparent finding 100s of pages of SHA1 values (1 per line) in the debug log where I had just printing the $record variable I had read.

--Solution, so far--
This was coded using the Perl buffered IO (opens, seek, read). That was not the best choice for doing a random access binary search of fixed length records (and there was/is something that breaks occasionally). So I switched the index search to unbuffered "sys[open|seek|read]" and haven't had this slurping problem with the index. The data file is read sequentially, so buffered access would be useful here. But eventually, the same slurping happened with the data file. I modified all the file IO code to use only the 'sys....' calls..

Any thoughts on the unexpected IO slurping problem?

It is always better to have seen your target for yourself, rather than depend upon someone else's description.


In reply to Buffered IO and un-intended slurping by Wiggins

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.