in reply to format text which mixed english and chinese characters.

A few points:

Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary.

This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want.

#!/usr/bin/perl -w use strict; # NOTE: use a pipe or redirection to feed input data to this script binmode( STDIN, ":encoding(cp936)" ); binmode( STDOUT, ":encoding(cp936)" ); # (you could add a command-line option to select # a different input/output character encoding) while (<>) { # first, convert any "fullwidth" ascii characters to normal ascii # (ff01-ff5e is the unicode range for "fullwidth ascii", and it # can be transferred directly to the ascii range 0x21-0x7e): tr/\x{ff01}-\x{ff5e}/!-~/; # now split into chunks: ideographic vs. non-ideographic # note that we put capturing parens around the split regex): my @chunks = split /(\p{Ideographic}+)/; # put the chunks back together, adding spaces to non-ideographics as n +eeded my $out = ''; if ( @chunks == 1 ) { $out = shift @chunks; } else { for ( my $i=0; $i <= $#chunks; $i++ ) { $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks; $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0; $out .= $chunks[$i]; } } print $out; }

Replies are listed 'Best First'.
Re^2: format text which mixed english and chinese characters.
by Qiang (Friar) on Feb 21, 2005 at 03:57 UTC
    I have been wanting to read unicode as i don't have any knowledge of it. that regex i used is off from the web.

    I do not have perl 5.8 to test the code (require Encode?). but i am sure i could use it later. thanks again!