I have large text files (5G-30G): like below:
@HWUSI-EAS1734_0032_FC620F7AAXX:5:1:18184:1176#CGATGT/1 GGATTTCTCGTGGANACCATTTGTTGGTCAANNNNNNNNNNGTGTTNGNCTTCANNGNNATTGAAAATGN +TCATTCGTGGCTATTTTCGCNNNNNATNNNN +HWUSI-EAS1734_0032_FC620F7AAXX:5:1:18184:1176#CGATGT/1 gggfggggfgeeecB```^]gffgegadcgBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS1734_0032_FC620F7AAXX:5:1:1934:1185#CGATGT/1 GTCATCCTTAATTANCGTATGTGCTCTTCCTNCNNNNNNNNGCTGCTANTTATTTCTNNGCAGCTTTGCT +CTTATTAGTTACGAACATGCCNNNNTANNNN +HWUSI-EAS1734_0032_FC620F7AAXX:5:1:1934:1185#CGATGT/1 acdad`^ddd^aa^B_\VZZfcfccaffBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB ..........
each 4 lines is a block, all the blocks are alike. all the even lines are the same length. I need to extract random lines in these files. So I build index files for these large text files. like script like below:
if (-e "$ARGV[0].idx") { open (INDEXFQ1, "$ARGV[0].idx") or die $!; } else { open (INDEXFQ1, "+>$ARGV[1].idx") or die $!; build_index(*FQ1, *INDEXFQ1); }
the question is, whenever I print lines in large line number, the out put is defective. the print code is like below:
print OQ10_1 line_with_index(*FQ1, *INDEXFQ1, $line);
no error output information, but the line in large line number is defective, like below:
741:20058#ATCACG/1 GTTCGTGAGAGCTCTAGGTTGTCGTCTCCCAGTCAACTATGGTCGCTGTAACGCGCTGACTT 41:20058#ATCACG/1 dgggg_ddadbaggedbXdd]^[UVYX]XR_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
can anybody help? thank you! sorry, there two sub for the build_index and line_with_index like below:
sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } } sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; my $i_offset; my $entry; my $d_offset; $size = length(pack("N", 0)); $i_offset = $size * ($line_number-1); seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); }

In reply to index for large text file by cafeblue

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.