search/replace very large file w/o linebreaks

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: search/replace very large file w/o linebreaks by davido (Cardinal) on Jan 08, 2004 at 16:46 UTC
Set the input record separator to be equal to one of the tags, and read the file in a bit at a time that way. `{ local $/ = "<tag1>"; while ( my $paragraph = <FILE> ) { $paragraph =~ s/<tag1>/\n/g; $paragraph =~ s/<tag2>/\t/g; print OUTFILE $paragraph; } }` [download] Dave	[reply] [d/l]
Re: search/replace very large file w/o linebreaks by dws (Chancellor) on Jan 08, 2004 at 17:58 UTC
I have to process a very large file (multi-GB) ... Matching in huge files demonstrates a sliding buffer technique for matching patterns that might span block boundaries in large files. You could adapt it for doing search/replace.	[reply]
Re: search/replace very large file w/o linebreaks by borisz (Canon) on Jan 08, 2004 at 16:46 UTC
I think you get you 'out of memory' errors because your file has no \n in it but your end of line is \n by default. So perl tries to eat the multi GB line and get the error. A naive try is this: `perl -pe 'BEGIN { $/ = "RowwoR"; };s/RowwoR/\n/' <infile >outfile` [download]	[reply] [d/l]
Re: search/replace very large file w/o linebreaks by allolex (Curate) on Jan 08, 2004 at 16:55 UTC
One thing you could try is to replace the string Perl uses to identify the ends of lines (input record separator) with one of the tags in your file. `{ local $/ = '<tag>'; while (<>) { # do stuff } }` [download] Or you could just feed a while loop chunks of data at a time using a trick described in perldoc perlvar: Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this: local $/ = \32768; # or \"32768", or \$var_containing_32768 open my $fh, $myfile or die $!; local $_ = <$fh>; will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. [download] Update: Commented out irrelevant text from perlvar. -- Allolex	[reply] [d/l] [select]
Re: Re: search/replace very large file w/o linebreaks by ysth (Canon) on Jan 08, 2004 at 17:39 UTC
If one of the tags overlaps the record boundary with the fixed length read, you can do the substitition on two blocks at a time. Pseudocode: `use constant BLOCKLENGTH => 32768; # > length of any tag $\ = \BLOCKLENGTH; my $buffer = ''; while (!eof()) { $buffer .= <>; s/tag1/\n/g; s/tag2/\t/g; # leave BLOCKLENGTH chars in $buffer, # print however much comes before that print substr($buffer, 0, -BLOCKLENGTH, ''); }; # print whatever's left print $buffer;` [download] Alternatively, just do a `while(<>)` loop with the substitution and print inside, but run it twice with two different record lengths such that no multiples (less than the file size) of the two come within a distance of each other less than the length of the longest tag. Don't know quite how to go about calculating such numbers, though. Any number theoreticians here?	[reply] [d/l] [select]
Re: Re: Re: search/replace very large file w/o linebreaks by nothingmuch (Priest) on Jan 09, 2004 at 11:37 UTC
if you're reading in chunks you might aswell use sysread: `sub BLOCKLENGTH () { 1 << 12 }; # TIMTOWTDI =) $_ = ''; while(sysread STDIN, $_, BLOCKLENGTH, length){ s///g; # you know syswrite STDOUT, substr($_, 0, -BLOCKLENGTH, ''); # the fourth ar +gument to substr will replace 0 .. -BLOCKLENGTH }; syswrite STDOUT, $_;` [download] It is more suited for the task, takes a bit less memory, and might even be faster if your stdio is stoned. Personally, i think that `perl -pe 'BEGIN{ $\ = "\n"; $/ = "tag" } chomp; s/tag2/\t/g; print' < infile > outfile` is the nicest way. Update: I thought some explanation was appropriate. The notion of what is a line is pretty flexible, and has to be (computers in general and specifically in perl). A line, traditionally, ended in a carrige return and a line feed, in one order or another. Windoze still uses that. MacOS uses only CRs, ~UNIX does only LF (i might be confused). The one byte solution is somwhat simpler. But since you need to support two bytes in case they come, why not support everything. Enters the concept of a record. Treating a line as a record, with either a fixed length (`$\ = \ 123`), or one ending with a certain string (`$\ = "\n"` is for a record which is also a line on your native system) adds the flexibility to do something like you wanted quite easily. You're translating a record format that ends in a certain string, to one that ends with newlines. $\ is the output record seperator, BTW. -nuffin zz zZ Z Z #!perl	[reply] [d/l] [select]
Re: Re: Re: Re: search/replace very large file w/o linebreaks by borisz (Canon) on Jan 09, 2004 at 11:48 UTC
Re: Re: Re: Re: Re: search/replace very large file w/o linebreaks by nothingmuch (Priest) on Jan 09, 2004 at 11:51 UTC
Re: Re: Re: Re: Re: search/replace very large file w/o linebreaks by ysth (Canon) on Jan 11, 2004 at 03:13 UTC
Some notes below your chosen depth have not been shown here
Re: search/replace very large file w/o linebreaks by Roy Johnson (Monsignor) on Jan 08, 2004 at 16:53 UTC
`$/='RowwoR'; while (<>) { chomp; print $_, "\n"; }` [download] or as a one-liner: `perl -pe "BEGIN{$/='RowwoR'}; s/RowwoR/\n/" file > file2` The PerlMonk `tr///` Advocate	[reply] [d/l] [select]