Re: format text which mixed english and chinese characters.

A few points:

You don't say so, but your script is hard-coded to handle text that uses CP936 encoding for Chinese. It would probably work with other GB-based encodings as well as Big5, which all use the same basic strategy, but it would go wrong if the input text turned out to be any sort of unicode.
All the encodings for Chinese (including unicode) have a section of code points for "wide" versions of the ASCII characters: in addition to the single-byte ASCII digits, alphabet, punctuation marks and brackets, there are two-byte renderings for these characters also -- but your code treats all 2-byte characters as "Chinese". (It looks like there's a two-byte comma in the last line of your DATA.)
The code could be written more simply, especially if you have Perl 5.8.x and convert the text to internal utf8 before applying regexes; depending on what version of Perl you're using, the unicode might slow it down noticeably (probably only a problem with 5.8.0 and 5.8.1), but you gain a lot in clarity and maintainability.

Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary.

This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want.

#!/usr/bin/perl -w

use strict;

# NOTE: use a pipe or redirection to feed input data to this script

binmode( STDIN, ":encoding(cp936)" );
binmode( STDOUT, ":encoding(cp936)" );

# (you could add a command-line option to select
# a different input/output character encoding)

while (<>)
{
# first, convert any "fullwidth" ascii characters to normal ascii
# (ff01-ff5e is the unicode range for "fullwidth ascii", and it
# can be transferred directly to the ascii range 0x21-0x7e):

    tr/\x{ff01}-\x{ff5e}/!-~/;  

# now split into chunks: ideographic vs. non-ideographic
# note that we put capturing parens around the split regex):

    my @chunks = split /(\p{Ideographic}+)/;

# put the chunks back together, adding spaces to non-ideographics as n
+eeded

    my $out = '';
    if ( @chunks == 1 ) {
        $out = shift @chunks;
    } else {
        for ( my $i=0; $i <= $#chunks; $i++ ) {
            $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks;
            $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0;
            $out .= $chunks[$i];
        }
    }
    print $out;
}
[download]

Comment on Re: format text which mixed english and chinese characters. Download Code

Replies are listed 'Best First'.
Re^2: format text which mixed english and chinese characters. by Qiang (Friar) on Feb 21, 2005 at 03:57 UTC
I have been wanting to read unicode as i don't have any knowledge of it. that regex i used is off from the web. I do not have perl 5.8 to test the code (require Encode?). but i am sure i could use it later. thanks again!	[reply]