Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks, Do you mind me bothering you with another unix question? I got stuck with a problem of getting the line with ith occurrence of a symbol, '>' in this case, from a file. I have 3 million '>' in my file. What I want to do is to split the file into 8 pieces that I have even number of '>'s in each file. (the file contains other things and I do not know where these '>' locate). The file is too large to get open in memory. I read sth about csplit but it split a file whenever it sees a '>'. So I am thinking maybe I want to get the line of the 10000th '>' and use it for csplit. Does anyone have better idea or how I can get the line? Any of your idea or hints will be highly appreciated. Ginger
  • Comment on get the line with ith occurrence of '>'

Replies are listed 'Best First'.
Re: get the line with ith occurrence of '>'
by waswas-fng (Curate) on Oct 04, 2002 at 16:11 UTC
    Can you give us 10 example lines from the file?

    You can do something like:
    my $filecounter = 0; open INPUTFILE, "InputFile"; open OUTPUTFILE, "outputfile.$filecounter"; while (<INPUTFILE>) { if ($. % 1000000) { close OUTPUTFILE; $filecounter++; open OUTPUTFILE, "outputfile.$filecounter"; } print OUTPUTFILE $_; }

    If one ">" per line.. but example data would make it easier to tell how robust the split has to be.

    Waswas-fng
Re: get the line with ith occurrence of '>'
by Zaxo (Archbishop) on Oct 04, 2002 at 17:34 UTC

    Update: Install a kernel with large file support enabled.

    Untested code:

    open my $infile, '<', '/path/to/basename.ext' or die $!; # count the >'s and get average number for each split file my $total = 0; $total += tr/>// for <$infile>; $each = $total / 8; # go bach to start and write first seven split files seek $infile, 0, 0 or die $!; for (1..7) { my $splitcount = 0; open my $outfile, '>', "/path/to/basename.split${_}.ext" or die $!; $splitcount += tr/>//, ($splitcount >= $each ? print $outfile $_ : last) for <$infile>; close $outfile or die $!; } # put the rest in split #8 open my $outfile, '>', "/path/to/basename.split8.ext" or die $!; print $outfile $_ for <$infile>; close $outfile or die $!; close $infile or die $!;

    This will leave the first seven with slightly more than the desired average, and the last several short.

    After Compline,
    Zaxo

Re: get the line with ith occurrence of '>'
by Anonymous Monk on Oct 04, 2002 at 16:43 UTC
    One more thing: the operating system is red hat linux with the memory of 1 GB, the file is of 2.7 GB.
Re: get the line with ith occurrence of '>'
by Anonymous Monk on Oct 04, 2002 at 17:15 UTC
    Dear Monk, I tried this program only to open it:
    #!/usr/bin/perl -w # #test if open can be used on large file #failed to open a 2.7 giga B file in 1 giga memory my $file = $ARGV[0]; print "$file\n"; open(IN, "<$file") || die "Could not open $file: $?\n"; close(IN);

    the output is can not open the file. But when I tried it with a 126MB file, it works fine. The file is even too big for the system method open that I tried in a c program which returns a bad file descriptor. Again, it works fine with the 126 MB file. The OS is linux red hat. The file is of
    ">title
    data
    data
    .."
    format. 3 Millions of them. The memory of the machine is 1GB. The file is of 2.7GB. I am desperate now. Thank you for any hint/help. ginger

    Edit kudra, 2002-10-04 Replaced br tags with code tags