comment on

Yep, I over simplified it. Here's a sample ( I quickly pulled code from a bunch of places in my workspace, so excuse the mess )

use strict;

sub extract
{
    my $str = shift;

    ##
    ## Define the possible charcter sets...
    ##
    my $regular_euc  = q/
        (?:\xa1[\xa1-\xff]) |
        (?:\xfe[\x00-\xfe]) |
        (?:[\xa2-\xfd][\x00-\xff])
    /;
 
    my $hankaku_kana = q/(?:\x8e[\xa1-\xdf])/;
    my $ascii        = q/(?:[\x20-\x7e])/;

    ##
    ## Confused? So am I!
    ##
    ## Basically, this is what it says:
    ##
    ## regular euc ( 2 bytes ) =>
    ##      \xa1 can be followed by range \xa1 to \xff OR
    ##      \xfe can be followed by range \x00 to \xfe OR
    ##      range \xa2 to \xfd can be followed by range \x00 to \xff
    ##
    ## user defined ( 3 bytes ) =>
    ##      \x8e can be followed by sequence that follows the
    ##      "regular euc" rule. -- this has been ommited. For
    ##      my purposes this will never be used.
    ##
    ## hankaku kana ( 2 bytes ) =>
    ##      \x8e can be followed by range \xa1 to \xdf.
    ##      (Notice that since the 2 bytes fall in the range of
    ##      "user defined" encoding, we match this AFTER "user defined
+".
    ##      So hankaku kana is matched only when the "user defined"
    ##      case fails)
    ##
    ## ascii ( 1 byte ) =>
    ##      range \x20 to \x7e. This only includes "printable"
    ##      ASCII
    ##

    $str =~ m<
        (
             $regular_euc  |
             $hankaku_kana |
             $ascii        
        )
    >gxo
}

sub to_regexp
{
    my @tokens = @_;

    my $regexp;
    foreach my $token ( @tokens ) {
        if( length( $token ) == 2 ) {
            $regexp .= sprintf(
               '(?:\x%s)', unpack( "H*", substr( $token, 0, 1 ) )
            );
            $regexp .= sprintf(
               '(?:\x%s)', unpack( "H*", substr( $token, 1, 1 ) )
            );
        } else {
            $regexp .= $token;
        }
   }       
   $regexp;
}        

my $string = "put some japanese ( euc ) string in here -- pm doesn't a
+ccept my input, unfortunately"
my $pattern     =  "place here a pattern -- yeah, if you're malicious 
+enough this will break";

my @tokens = extract( $pattern );
my $byte_pattern = to_regexp( @tokens );

$string =~ s/$byte_pattern/some_new_pattern/g;
print $byte_pattern, "\n";
print $string. "\n";
[download]

As I said, this is hack. I'm well aware of that. But it serves my purpose

In reply to Example - Re(4): Programming in Japanese by lestrrat
in thread Pyuuta: Programming in Japanese by miyagawa

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.