Re: Orthography Translation using Regex

I needed to do a similar task for pre-Unicode Mongolian to Unicode. The hash provided is only a sample. The full script is at http://students.washington.edu/blanch/downloads/encodingConverter.pl You can drop your hash in and it should run with little modification if any.

#!/usr/local/bin/perl

# encodingConverter.pl
# Duane L. Blanchard
# http://students.washington.edu/blanch/downloads/
# blanch@iname.com

use strict;
use warnings;
#use utf8;
use charnames ':full';

#hash tables for each encoding must be at the top
#Hash table Keys: Cyrillic Chars, Values: Unicode Char Names
my %name = (
# Lowercase
"ŕ" => "\N{CYRILLIC SMALL LETTER A}",
"á" => "\N{CYRILLIC SMALL LETTER BE}",
"â" => "\N{CYRILLIC SMALL LETTER VE}",
"ă" => "\N{CYRILLIC SMALL LETTER GHE}",

# Uppercase
"A" => "\N{CYRILLIC CAPITAL LETTER A}",
"Á" => "\N{CYRILLIC CAPITAL LETTER BE}",
"Â" => "\N{CYRILLIC CAPITAL LETTER VE}",
"Ă" => "\N{CYRILLIC CAPITAL LETTER GHE}",
);

# Open the input file
my $inFile;
until(open(OUTFILE, ">outFile.txt"))
{
    print("\n$inFile could not be found.");
}

print("What file would you like to convert? \n");
$inFile = <stdin>;    #query user for input file
chomp $inFile;

until(open(inFile, "$inFile")) 
{
    print("\n$inFile could not be found.",
    " Please provide the absolute path. \n");
    $inFile = <stdin>;
}

while (<inFile>) 
{
    my $line = $_; # $_ is a line of text
    my @array = split ("", $line); # $_ is now a character

    for (@array)
    {
        if (exists $name{$_})    # check the hash for $_
        {    
            print OUTFILE $name{$_}; # print the Unicode value of $_
        }
        else
        {
            print OUTFILE "$_";    # preserves English
        }
    }
}

close OUTFILE;

print "\nYour converted text is in:\n",
        ">> outFile.txt.\n\n";
[download]

Comment on Re: Orthography Translation using Regex Download Code

Replies are listed 'Best First'.
Re: Re: Orthography Translation using Regex by graff (Chancellor) on Mar 01, 2004 at 09:15 UTC
A couple minor points about this script (unrelated to the main theme of the thread): First, @ARGV is your friend -- use it to get input and output file names from the command line. Here's one way to do it: `my $Usage = "Usage: $0 infile outfile\n"; # open input and output files die $Usage unless ( @ARGV == 2 ); open( IN, $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\n$Usage"; open( OUT, ">$ARGV[1]" ) or die "Unable to write $ARGV[1]: $!\n$Usage" +; ...` [download] You have problems in both of your "until (open(...))" loops, which would be avoided if you use @ARGV (because you don't need those loops at all). In your first "until" loop, if there ever really is a failure to open the output file, there's no exit from that loop -- not good. As for the second one (for getting an input file name), you forgot to "chomp" the user input that you read inside the loop, which means the loop will never succeed (unless a file name happens to contain a final newline character) -- also not good. For that matter, you could do without open statements altogether -- just use `while (<>)` to read input (from a named file or from stdin), and just print to STDOUT. Let the users decide if/when to redirect these to or from a disk file (e.g. as opposed to piping data to/from other processes): `converter.pl < some.input > some.output # or some_process \| converter.pl \| another_process # or any combination of the above...` [download] As for the main "while()" loop, it can be expressed more compactly without loss of clarity: `while (<IN>) { my @chars = split //; for (@chars) { # $_ now holds one char per iteration my $out = ( exists $name{$_} ) ? $name{$_} : $_; print $out; } }` [download] Finally, you may want to look at "perldoc enc2xs", which gives a nice clear explanation about how to roll your own encoding modules that can be used in combination with Encode (i.e. on a par with "iso-8859-1" or "koi8-r"), to convert back and forth bewteen Unicode and your own particular non-Unicode character set. It's actually pretty simple, provided that your mapping involves just one character of output for each character of input (which is not true for the OP that started this thread, unfortunately). If you're the same Anonymous Monk who posted the first reply to the script, I don't expect this will help with the problem you mentioned (only handling small files) -- maybe you need to start your own SoPW thread on that...	[reply] [d/l] [select]
Re: Re: Orthography Translation using Regex by Anonymous Monk on Mar 01, 2004 at 06:25 UTC
I just found that my script, which I am finishing just now, only handles short input files. I can't determine yet why.	[reply]