comment on

A few points:

You don't say so, but your script is hard-coded to handle text that uses CP936 encoding for Chinese. It would probably work with other GB-based encodings as well as Big5, which all use the same basic strategy, but it would go wrong if the input text turned out to be any sort of unicode.
All the encodings for Chinese (including unicode) have a section of code points for "wide" versions of the ASCII characters: in addition to the single-byte ASCII digits, alphabet, punctuation marks and brackets, there are two-byte renderings for these characters also -- but your code treats all 2-byte characters as "Chinese". (It looks like there's a two-byte comma in the last line of your DATA.)
The code could be written more simply, especially if you have Perl 5.8.x and convert the text to internal utf8 before applying regexes; depending on what version of Perl you're using, the unicode might slow it down noticeably (probably only a problem with 5.8.0 and 5.8.1), but you gain a lot in clarity and maintainability.

Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary.

This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want.

#!/usr/bin/perl -w

use strict;

# NOTE: use a pipe or redirection to feed input data to this script

binmode( STDIN, ":encoding(cp936)" );
binmode( STDOUT, ":encoding(cp936)" );

# (you could add a command-line option to select
# a different input/output character encoding)

while (<>)
{
# first, convert any "fullwidth" ascii characters to normal ascii
# (ff01-ff5e is the unicode range for "fullwidth ascii", and it
# can be transferred directly to the ascii range 0x21-0x7e):

    tr/\x{ff01}-\x{ff5e}/!-~/;  

# now split into chunks: ideographic vs. non-ideographic
# note that we put capturing parens around the split regex):

    my @chunks = split /(\p{Ideographic}+)/;

# put the chunks back together, adding spaces to non-ideographics as n
+eeded

    my $out = '';
    if ( @chunks == 1 ) {
        $out = shift @chunks;
    } else {
        for ( my $i=0; $i <= $#chunks; $i++ ) {
            $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks;
            $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0;
            $out .= $chunks[$i];
        }
    }
    print $out;
}
[download]

In reply to Re: format text which mixed english and chinese characters. by graff
in thread format text which mixed english and chinese characters. by Qiang

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.