get the line with ith occurrence of '>'

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks, Do you mind me bothering you with another unix question? I got stuck with a problem of getting the line with ith occurrence of a symbol, '>' in this case, from a file. I have 3 million '>' in my file. What I want to do is to split the file into 8 pieces that I have even number of '>'s in each file. (the file contains other things and I do not know where these '>' locate). The file is too large to get open in memory. I read sth about csplit but it split a file whenever it sees a '>'. So I am thinking maybe I want to get the line of the 10000th '>' and use it for csplit. Does anyone have better idea or how I can get the line? Any of your idea or hints will be highly appreciated. Ginger

Comment on get the line with ith occurrence of '>'

Replies are listed 'Best First'.
Re: get the line with ith occurrence of '>' by waswas-fng (Curate) on Oct 04, 2002 at 16:11 UTC
Can you give us 10 example lines from the file? You can do something like: `my $filecounter = 0; open INPUTFILE, "InputFile"; open OUTPUTFILE, "outputfile.$filecounter"; while (<INPUTFILE>) { if ($. % 1000000) { close OUTPUTFILE; $filecounter++; open OUTPUTFILE, "outputfile.$filecounter"; } print OUTPUTFILE $_; }` [download] If one ">" per line.. but example data would make it easier to tell how robust the split has to be. Waswas-fng	[reply] [d/l]
Re: get the line with ith occurrence of '>' by Zaxo (Archbishop) on Oct 04, 2002 at 17:34 UTC
Update: Install a kernel with large file support enabled. Untested code: open my $infile, '<', '/path/to/basename.ext' or die $!; # count the >'s and get average number for each split file my $total = 0; $total += tr/>// for <$infile>; $each = $total / 8; # go bach to start and write first seven split files seek $infile, 0, 0 or die $!; for (1..7) { my $splitcount = 0; open my $outfile, '>', "/path/to/basename.split${_}.ext" or die $!; $splitcount += tr/>//, ($splitcount >= $each ? print $outfile $_ : last) for <$infile>; close $outfile or die $!; } # put the rest in split #8 open my $outfile, '>', "/path/to/basename.split8.ext" or die $!; print $outfile $_ for <$infile>; close $outfile or die $!; close $infile or die $!; [download] This will leave the first seven with slightly more than the desired average, and the last several short. After Compline, Zaxo	[reply] [d/l]
Re: get the line with ith occurrence of '>' by Anonymous Monk on Oct 04, 2002 at 16:43 UTC
One more thing: the operating system is red hat linux with the memory of 1 GB, the file is of 2.7 GB.	[reply]
Re: get the line with ith occurrence of '>' by Anonymous Monk on Oct 04, 2002 at 17:15 UTC
Dear Monk, I tried this program only to open it: `#!/usr/bin/perl -w # #test if open can be used on large file #failed to open a 2.7 giga B file in 1 giga memory my $file = $ARGV[0]; print "$file\n"; open(IN, "<$file") \|\| die "Could not open $file: $?\n"; close(IN);` [download] the output is can not open the file. But when I tried it with a 126MB file, it works fine. The file is even too big for the system method open that I tried in a c program which returns a bad file descriptor. Again, it works fine with the 126 MB file. The OS is linux red hat. The file is of ">title data data .." format. 3 Millions of them. The memory of the machine is 1GB. The file is of 2.7GB. I am desperate now. Thank you for any hint/help. ginger Edit kudra, 2002-10-04 Replaced br tags with code tags	[reply] [d/l]