anonymoushydrogen has asked for the wisdom of the Perl Monks concerning the following question:

i have a huge text file consisting of character of DNA i.e ATCG with 21197 lines. each lines have 70 character. MY problem is that i have to extract eg all character between 500 and 1500 character. for that i made a new file removing all \n. then i made use of substr function to extract all cahracter between 500 to 1500 but when i have to extract the chracter between 1490117..1492312 ! the program is not able to extract it............plz help me out......... my text file consist of 1504987 character abd i have to extreact all the character between 1490117..1492312 problem solved by the help of clinton............many many thanks for his help

Replies are listed 'Best First'.
Re: text file problem
by hipowls (Curate) on May 10, 2008 at 21:02 UTC

    I think your problem is

    • you have a large file
    • you have created a new file with line endings removed
    • now you want to read in characters between two positions in the new file
    To do that you can use sysseek and sysread.
    use Fcntl qw(SEEK_SET); my $input = '...'; my $start = 1_490_117; my $end = 1_492_312; my $length = $end - $start +1; open my $fh, '<', $input or die "Can't open $input: $!"; sysseek($fh, $start, SEEK_SET) or die "Can't seek to $start in $input: + $!"; my $sequence; my $read = sysread $fh, $sequence, $length; die "Failed to read $length bytes from $input, got $read" if $length != $read;
    Now $sequence will contain the DNA sequence from 1,490,117 to 1,492,312 including both end points.

    Note: sysread and sysseek use unbuffered IO, don't mix calls to them on a filehandle using other functions such as read, <> or eof.

Re: text file problem
by syphilis (Archbishop) on May 10, 2008 at 13:59 UTC
    Hi anonymoushydrogen,

    By my reckoning 21197 x 70 == 1483790. So characters 1490117 .. 1492312 do not exist.

    Cheers,
    Rob
      I assume he's slurping the file, including the \ns. (He is, afterall, talking about byte offsets into a file!) Therefore, 21197 * 71 = 1504987, so characters 1490117 .. 1492312 could exist!
Re: text file problem
by apl (Monsignor) on May 10, 2008 at 15:12 UTC
    Could you provide a sample of the input and expected output, and the code you've written? Then we could make more intellegent suggestions.
Re: text file problem
by Gangabass (Vicar) on May 10, 2008 at 14:16 UTC

    I think syphilis right or you don't need to remove new line characters.