in reply to Re (tilly) 1: Pyuuta: Programming in Japanese
in thread Pyuuta: Programming in Japanese

Ruby seems really cool, but in some places their support base is so anti-Perl, I kinda hesitate to learn it :-) ( Actually, I would start learning once the american o'reilly comes up with a good book to read... I find Japanese technical books to be harder to understand )

Prsonally, I haven't had much problem with using regexp on Japanese characters. Of course, the approach I take is

Yes, it's kind of annoying, and yes, it's hackish approach, but it works for me.

Update: posted code

  • Comment on Re: Re (tilly) 1: Pyuuta: Programming in Japanese

Replies are listed 'Best First'.
Re: Re: Re (tilly) 1: Pyuuta: Programming in Japanese
by stefan k (Curate) on Oct 10, 2001 at 19:56 UTC

    is so anti-Perl, I kinda hesitate to learn it

    Well, I'm coming from perl and I feel fine with coding in ruby. And it's not really andit-Perl, they took a good load of the good perl stuff :)

    once the american o'reilly comes up with a good book to read...

    I took "Programming Ruby" by David Thomas and Andrew Hunt (which is from Addison Wesley) and it gave me a very good start. The only problem I got is the small code base to look at...

    Regards... Stefan
    you begin bashing the string with a +42 regexp of confusion

Re (tilly) 3: Pyuuta: Programming in Japanese
by tilly (Archbishop) on Oct 10, 2001 at 20:25 UTC
    A problem with your approach.

    Kanji is a multi-byte character set. It is possible for Perl to find a match starting in between the characters you are looking for. With long strings it is not likely, but still it is possible and confusing if you do.

    As for Ruby, this book is quite good. And yes, there are morons who like Ruby and hate Perl. But my experience was that the core Ruby people (people like Matz and Dave Thomas) by and large didn't share that attitude.

    My personal take on Ruby is that it is an interesting language. I am glad I learned it. I think it is more cleanly structured than Perl, it is more cleanly extensible and I believe that I could more rapidly bring someone up to speed on Ruby than Perl. However it does not have Perl's broad application support, it lacks CPAN, you will have to train people, and I didn't find it compelling enough to recode an existing application base. The single biggest "Uh, oh" for me is that it doesn't have an equivalent to strict.pm.

    However learning Ruby made me see and understand certain aspects of Perl better, so even if I never use it, I still think it was a good thing to do.

Regex and Japanese
by Hanamaki (Chaplain) on Oct 10, 2001 at 21:05 UTC
    For some short introductions on how to use Regular Expressions with multibyte character sets I would like to recommend Ken Lunde's excellent papers on this topic. Have a look at all the pdf files you will find in the Perl ftp directory for the bookCJKV Information Processing.

    Hanamaki
Re: Re: Re (tilly) 1: Pyuuta: Programming in Japanese
by Hanamaki (Chaplain) on Oct 10, 2001 at 20:29 UTC
    Sorry I don't get it. How is your approach supposed to work? If you did not forget mentioning one or two important steps, you have definetly big problems using regexps on euc strings. How do you anchor your string and keep in sync (= How do you know your Byte is the only, first or second byte of a character?)
    But maybe I missunderstood you, and it would be nice to see an example.

    Hanamaki

      Yep, I over simplified it. Here's a sample ( I quickly pulled code from a bunch of places in my workspace, so excuse the mess )

      use strict; sub extract { my $str = shift; ## ## Define the possible charcter sets... ## my $regular_euc = q/ (?:\xa1[\xa1-\xff]) | (?:\xfe[\x00-\xfe]) | (?:[\xa2-\xfd][\x00-\xff]) /; my $hankaku_kana = q/(?:\x8e[\xa1-\xdf])/; my $ascii = q/(?:[\x20-\x7e])/; ## ## Confused? So am I! ## ## Basically, this is what it says: ## ## regular euc ( 2 bytes ) => ## \xa1 can be followed by range \xa1 to \xff OR ## \xfe can be followed by range \x00 to \xfe OR ## range \xa2 to \xfd can be followed by range \x00 to \xff ## ## user defined ( 3 bytes ) => ## \x8e can be followed by sequence that follows the ## "regular euc" rule. -- this has been ommited. For ## my purposes this will never be used. ## ## hankaku kana ( 2 bytes ) => ## \x8e can be followed by range \xa1 to \xdf. ## (Notice that since the 2 bytes fall in the range of ## "user defined" encoding, we match this AFTER "user defined +". ## So hankaku kana is matched only when the "user defined" ## case fails) ## ## ascii ( 1 byte ) => ## range \x20 to \x7e. This only includes "printable" ## ASCII ## $str =~ m< ( $regular_euc | $hankaku_kana | $ascii ) >gxo } sub to_regexp { my @tokens = @_; my $regexp; foreach my $token ( @tokens ) { if( length( $token ) == 2 ) { $regexp .= sprintf( '(?:\x%s)', unpack( "H*", substr( $token, 0, 1 ) ) ); $regexp .= sprintf( '(?:\x%s)', unpack( "H*", substr( $token, 1, 1 ) ) ); } else { $regexp .= $token; } } $regexp; } my $string = "put some japanese ( euc ) string in here -- pm doesn't a +ccept my input, unfortunately" my $pattern = "place here a pattern -- yeah, if you're malicious +enough this will break"; my @tokens = extract( $pattern ); my $byte_pattern = to_regexp( @tokens ); $string =~ s/$byte_pattern/some_new_pattern/g; print $byte_pattern, "\n"; print $string. "\n";

      As I said, this is hack. I'm well aware of that. But it serves my purpose