Re: Re: Orthography Translation using Regex

A couple minor points about this script (unrelated to the main theme of the thread):

First, @ARGV is your friend -- use it to get input and output file names from the command line. Here's one way to do it:

my $Usage = "Usage:  $0  infile  outfile\n";

# open input and output files
die $Usage unless ( @ARGV == 2 );

open( IN, $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\n$Usage";
open( OUT, ">$ARGV[1]" ) or die "Unable to write $ARGV[1]: $!\n$Usage"
+;

...
[download]

You have problems in both of your "until (open(...))" loops, which would be avoided if you use @ARGV (because you don't need those loops at all). In your first "until" loop, if there ever really is a failure to open the output file, there's no exit from that loop -- not good. As for the second one (for getting an input file name), you forgot to "chomp" the user input that you read inside the loop, which means the loop will never succeed (unless a file name happens to contain a final newline character) -- also not good.

For that matter, you could do without open statements altogether -- just use while (<>) to read input (from a named file or from stdin), and just print to STDOUT. Let the users decide if/when to redirect these to or from a disk file (e.g. as opposed to piping data to/from other processes):

converter.pl < some.input > some.output
# or
some_process | converter.pl | another_process
# or any combination of the above...
[download]

As for the main "while()" loop, it can be expressed more compactly without loss of clarity:

while (<IN>) {
    my @chars = split //;
    for (@chars) {   # $_ now holds one char per iteration
        my $out = ( exists $name{$_} ) ?  $name{$_} : $_;
        print $out;
    }
}
[download]

Finally, you may want to look at "perldoc enc2xs", which gives a nice clear explanation about how to roll your own encoding modules that can be used in combination with Encode (i.e. on a par with "iso-8859-1" or "koi8-r"), to convert back and forth bewteen Unicode and your own particular non-Unicode character set. It's actually pretty simple, provided that your mapping involves just one character of output for each character of input (which is not true for the OP that started this thread, unfortunately).

If you're the same Anonymous Monk who posted the first reply to the script, I don't expect this will help with the problem you mentioned (only handling small files) -- maybe you need to start your own SoPW thread on that...

Comment on Re: Re: Orthography Translation using Regex Select or Download Code