resub

Category:	Utility Scripts
Author/Contact Info	David Graff <graff@ldc.upenn.edu>
Description:	Do any number of global regex substitutions uniformly over any number of text files, and correctly handle all character encodings supported by Perl 5.8.0, with optional conversion of data from one encoding to another. (update: fixed checks for valid regexes)
#!/usr/bin/perl =head1 NAME resub -- applies regex substitutions on selected files =head1 SYNOPSIS resub -h (or resub -?) : list supported character encodings and exi +t resub [-r enc] [-w enc] [-o .ext] [-s '/old/new/[ise]'] [-f regex.li +st] file.name[s] cat filename.list \| resub [options] stdin (or resub [options] - ) =head1 SAMPLE USAGE grep -l badword .txt \| resub -s '/badword/good word/' - # fix a s +pelling error resub -f edit.list .txt # fix lots of things at once resub -s /Bill/George/ -o .updated crank.letter # keeps crank.letter, creates crank.letter.updated =head1 DESCRIPTION When a particular string replacement (or set of replacements) needs to be applied to one or more files, resub makes it possible to apply all the replacements with a single command line. The specification of the editing to be done can be provided as one or more "-s /old/new/" options on the command line, and/or listed in a separate text file whose name is given with the "-f" option. Both "-s" and "-f" edit specs can be used in combination. All edits will be applied globally to the content of each text file, in the order given. Resub requires Perl 5.8.0 or later, in order to use the Encode module and the internal Unicode representation of characters. Data files and edit.list files can use any character encoding supported by Encode (the complete list of supported character sets is listed with the "-h" option). Also, the edited data can be written in any encoding (so character code conversion is simple). The default encoding for reading and writing text is "utf8" (which means that plain ASCII text will be treated correctly during default operation). Alternate encodings for input or output can be selected with the "-r" and "-w" options, respectively. Any regex-list file(s) given with the "-f" option must use the same encoding as the input text file(s). Specify regex instructions just as you would in Perl, using curly braces ( {pattern}{replacement} ) or an arbitrary delimiter character (/p/r/, %p%r%, etc). The "i", "e" and "s" flags can be added after the final delimiter. Note that the "g" flag is always attached to the expression before applying it to the data, to do global replacements. When placing regex instructions in a file for use with "-f", put one regex per line (they will be applied in top-down order). Names (or wildcard) for one or more files to edit can be given on the command line; or, instead of given any file name as a command line arg, use "-" or "stdin", and resub will expect to read a list of file names from STDIN. By default, resub will first create a temporary file to hold the edited version of each text file, then (if there are no errors) it will rename the temporary to the original file name, so that the original file content is overwritten. You can preserve the original file(s) by using the "-o .ext" option: in this case, whatever string follows "-o" will be appended to each original input file name, and the resulting name will be used to receive the edited data. =head1 CHARACTER ENCODING ISSUES If "-r" is used to specify a given encoding for the input text file(s), resub will use Encode to convert the data, including regex-list files, if any, to unicode. Whether the input data is already utf8 or some other encoding, it is possible that a data file or regex-list may contain byte sequences that are invalid or undefined for the selected input encoding. When reading a regex-list file, resub will report and skip any single regex that contains bad character data (but other regexes will be used). When reading from a text file, encountering bad character data will cause the editing of that file to be aborted, and the original file will remain unaltered, and resub will report the problem (but then go on to edit other files, if any). If some non-unicode encoding has been selected for output with the "-w" option, resub will convert the data accordingly. Again, it is possible that the data may contain characters that cannot be represented in the character set chosen for output. As with input, the editing of the file is aborted, and the original file remains unaltered, the problem is reported (and other files, if any, are handled). =cut require 5.008_000; use strict; use Encode; # we will read whole files into memory, so let's be careful about file + size: use constant MAX_FILE_SIZE => 2 ** 26; # that's about 67 MB (okay for + most unix boxes) $0 =~ s%^./%%; my $Usage = "Usage:\n\n". " $0 [-r enc] [-w enc] [-o .ext] [-s '/old/new/[i]'] [-f edspec.fi +le] file.name[]\n\n". " $0 -h or -? : list available character encodings for reading (- +r) and writing (-w)\n". " default -r = utf8; default -w = same as -r (that is, -r by it +self means '-rw')\n\n". " Does in-place edit of text file(s) by applying regex substitutio +n(s)\n". " or, using '-o .ext', writes edited output to 'text.file(s).ext' +instead\n\n". " Multiple '-s ...' and/or '-f ...' args can be combined in one co +mmand\n". " ALL substitutions are applied globally, in order, throughout eac +h file\n\n". " Use '-' or 'stdin' as the file.name to read list of file names f +rom stdin\n\n"; my $newext = ".resub.$ENV{USER}.$$"; # default extension for temp +file my @enclist = Encode->encodings(":all"); # list of all supported chara +cter sets my (@subs,@regexs,@filelist,@regfile); my ($renc,$wenc) = ("",""); # The ordering of "-s /old/new/" and "-f edspec.file" options may be # significant, so we will do our own command-line option processing, # rather than the usual "Getopt" module. listEncodings() if ( grep /^-+[h\?]/, @ARGV ); # this will exit when +done while ( @ARGV > 1 && $ARGV[0] =~ /^-([rwosf])$/ ) { my $argtyp = $1; if ( $argtyp eq "s" ) { if ( $ARGV[1] =~ /^\{.?\}\s\{.\}[ise]$/ or # check for regex +delimiters ( $ARGV[1] =~ /^(.)(.)\1(.)\1[ise]$/ && (length($2) or len +gth($3)))) { push( @regexs, $ARGV[1] ); } else { die "Invalid regex on command line: $ARGV[1]\n"; } } elsif ( $argtyp eq "f" ) { open( INP, "$ARGV[1]" ) \|\| die "Unable to open editspec.file $ARGV +[1]\n"; while (<INP>) { s/\#.//; # delete commentary, if any next if ( /^\s$/ ); # skip empty lines s/^\s//; # delete leading and trailing whitespace, i +f any s/\s$//; if ( /^\{.?\}\s\{.\}[ise]$/ or # check for suitable re +gex delimiters ( /^(.)(.)\1(.)\1[ise]$/ && (length($2) or length($3)))) { push( @regexs, $_ ); $regfile[$#regexs] = "$ARGV[1] at line $."; } else { warn "Ignored invalid regex in $ARGV[1] at line $.\n"; } } close INP; } elsif ( $argtyp eq "o" ) { $newext = $ARGV[1]; } elsif ( $argtyp eq "r" ) { $renc = $ARGV[1]; } elsif ( $argtyp eq "w" ) { $wenc = $ARGV[1]; } shift; shift; } if ( @ARGV == 1 && $ARGV[0] =~ /^(-\|stdin)$/ ) { shift; while (<>) { chomp; push( @filelist, $_ ); } } else { @filelist = @ARGV; } die $Usage unless ( @regexs && @filelist ); $renc = "utf8" if ( $renc eq "" ); $wenc = $renc if ( $wenc eq "" ); listEncodings() unless ( grep( /$renc/, @enclist ) && grep( /$wenc/, @ +enclist )); foreach my $i ( 0 .. $#regexs ) { $_ = $regexs[$i]; if ( exists( $regfile[$i] ) and $renc ne "utf8" ) { unless ( eval "\$_ = decode( \$renc, \$_, Encode::FB_CROAK )" ) { warn "Bad character data for $renc in $regfile[$i]: skipped th +at regex\n"; next; } } push @subs, eval "sub { s$_ }" or die $@; } $/ = undef; # nullify record-separator variable, to read whole file(s) + into a scalar foreach my $file ( @filelist ) { unless ( -f $file ) { warn "$file is not a data file -- skipped this file\n"; next; } unless ( -s _ < MAX_FILE_SIZE ) { warn "$file is too big (break it up or use some other tool) -- ski +pped this file\n"; next; } unless ( open( INP, $file ) && ( $newext ne ".resub.$ENV{USER}.$$" + \|\| -w $file )) { warn "Unable to open $file for editing -- skipped this file\n"; next; } binmode( INP, ":utf8" ) if ( $renc eq "utf8" ); # (see pod::perlu +niinto man page) $_ = <INP>; # $_ will contain all data in $file close INP; unless ( $renc eq "utf8" or eval "\$_ = decode( \$renc, \$_, Encod +e::FB_CROAK )" ) { warn "Bad character data for $renc in $file -- skipped this file\n +"; next; } foreach my $resub ( @subs ) { $resub->(); } unless ( $wenc eq "utf8" or eval "\$_ = encode( \$wenc, \$_, Encod +e::FB_CROAK )" ) { warn "$file contains characters unwritable in $wenc -- skipped thi +s file\n"; next; } my $newfile = $file.$newext; unless ( open( OUT, ">$newfile" )) { warn "Unable to open output file -- $file unchanged\n"; next; } binmode( OUT, ":utf8" ) if ( $wenc eq "utf8" ); # (see pod::perlu +niinto man page) print OUT or die "Unable to write data to $newfile -- $file unchan +ged"; close OUT or die "Ack! Error occurred when closing $newfile -- $fi +le unchanged"; my $mode = (stat $file)[2] & 0777; chmod $mode,$newfile; rename $newfile, $file if ( $newext eq ".resub.$ENV{USER}.$$" ); } sub listEncodings { # user is asking for help: list all available encodings my $colwidth = length( (sort {length($b) <=> length($a)} @enclist) +[0] ) + 2; my $ncol = int( 80/$colwidth ); my $nrow = int( scalar(@enclist)/$ncol ); $nrow++ if ( scalar(@enclist) % $ncol ); my $fmt = "%-${colwidth}s"; print $Usage; foreach my $r ( 0 .. $nrow ) { foreach my $c ( 0 .. $ncol ) { my $i = $c * $nrow + $r; printf( $fmt, $enclist[$i] ); } print "\n"; } exit( 0 ); }

Comment on resub Download Code


Do you know where your variables are?
	PerlMonks