gio001 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I am encountering some issue in trying to do the following in ksh88, therefore I want to explore a smart perl approach, but my knowledge is limited, can you help?
I have 2 files, the first (indexFile1) contains lines with start offset and length for each record inside the second file, so just 2 numbers separated by a space. The second file can be very large, each actual record start offset and length is defined by the entry in indexFile1. Since there are no records separators wc-l returns 0 for the second file, no matter how large its size actually is.
I want to gather all the records from the large file one at a time and write them out to a new file individually. What is the best way to approach this processing?
I suspect I will have trouble in ksh88 reading a whole large file into a variable (using awk) and then use a cut command on the variable contents to collect my record in the form:
FileContent=$(awk '{print $0}' largeFile2) # this is where I think I h +ave a problem : # LINE contains start and offset identifying each record in largeFile2 while read LINE;do pass=1 for results in $LINE; do if [[ $pass -eq 1 ]];then from=$results pass=2 else to=$results fi done (( from=$val1+1 )) (( to=$val1+$val2 )) newOut=$(echo $FileContent|cut -c $from-$to) echo $newOut >> newfile done < indexFile1
I have it working ok for small sizes of largeFile2. I can see a problem when the size of file2 gets large. I hope you can give me some suggestions on how to do this better. Thanks!

Replies are listed 'Best First'.
Re: Read offset into other files
by GrandFather (Saint) on Oct 18, 2008 at 05:07 UTC

    Yup, your problem starts right at the first line as you suspect. What you have is not Perl even though you state you "want to explore a smart perl approach". Here is a sample Perl program that demonstrates the techniques you need to implement your task:

    use warnings; use strict; use Data::Dump::Streamer; my $str = <<SNIPS; Hello all, I am encountering some issue in trying to do the following +in ksh88, therefore I want to explore a smart perl approach, but my knowledge is + limited, can you help? I have 2 files, the first (indexFile1) contains lines with start offset and length for each record inside the second file, so just 2 nu +mbers separated by a space. ... SNIPS open my $recData, '<', \$str; while (<DATA>) { chomp; my ($start, $len) = split; my $segment; seek $recData, $start, 0; read $recData, $segment, $len; print "$segment\n"; } __DATA__ 16 12 80 9 315 10

    Prints:

    encountering therefore separated

    Note, to avoid needing external files I've used a string to provide one input "file" and provided the other as data following the body of the script. Note too that I tested this under Windows so you may get different output if your OS uses different line end characters than Windows does.


    Perl reduces RSI - it saves typing
Re: Read offset into other files
by BrowserUk (Patriarch) on Oct 18, 2008 at 07:38 UTC

    As a one-liner:

    perl -s -nle"BEGIN{open BIG};($undef,$l)=split;read BIG,$data,$l;print + $data" -- -BIG=bigfile.dat index.dat > outfile.dat

    All one one line. Switch "s to 's on *nix.

    Update: A slightly shorter version

    perl -sple"BEGIN{open BIG};($undef,$l)=split;read(BIG,$_,$l)" -- -BIG=bigfile.dat index.dat >outfile.dat

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      It is amazing how simple things become if you know how to use the right tools!
      Will this one liner process all the entries in the index file gathering and putting out strings from the bigfile without a need for a loop or a while?
      Please help me understand. Also there is no use of the offset value, am I reading this right, will the read head keep moving automatically inside the bigfile, forward from the last read?
      Thanks again.
        Will this one liner process all the entries in the index file gathering and putting out strings from the bigfile without a need for a loop or a while?

        Yes. The loop is invoked by the -p option on the command line. This tells perl to read the file given as a command line argument (index.dat above) into $_ and then print it to stdout.

        The code in the -e takes the contents of $_, splits it to extract the length, reads that number of bytes from the filehandle BIG, overwriting $_. This is then (implicitly) printed with a newline due to the -l switch, and redirected to the output file by the command line processor.

        The -s switch tells perl to parse the command line for options in the form -XXX=yyy. This creates a variable named XXX with the value yyy.

        The BEGIN{} block uses a one-arg open to open the file for input (using the value of the BIG as the filename and storing the filehandle the glob *BIG).

        The -- is required to allow Perl to differentiate between the options intended for use by perl itself, and those (-BIG=bigfile.dat) intended for use by the "script" (-e"...") in this case.

        See perlrun for a better explanation of all the switched than I can give.

        Also there is no use of the offset value, am I reading this right, will the read head keep moving automatically inside the bigfile, forward from the last read?

        Exactly. You are essentially just reading the file sequentially. The only extra information you need, is how many bytes constitute each record.

        Perl may have some weird nooks and crannies, but they're all there for very good reasons :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.