shamanoff has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks, I'm using the following regexp for substituting non-ascii characters with a question mark:

s/[[:^ascii:]]/\?/g;

My question is how to substitute all non-ascii characters with their hex-code using regexp. For example, required output would be: "This is string with non-ascii characters, like e2"

Replies are listed 'Best First'.
Re: hex in regexp
by jwkrahn (Abbot) on May 22, 2012 at 04:55 UTC

    The standard idiom is:

    s/([[:^ascii:]])/ sprintf '%02x', ord $1 /eg;
Re: hex in regexp
by kcott (Archbishop) on May 22, 2012 at 03:47 UTC

    The standard way is to do this is like the following:

    $ perl -Mstrict -Mwarnings -E ' > my $x = q{X-X!X?X}; > $x =~ s/([[:^upper:]])/unpack q{H2}, $1/eg; > say $x; > ' X2dX21X3fX

    Note: I used upper instead of ascii for my example code.

    There may be modules that also supply this functionality. If there are, I'm sure other monks will provide details.

    -- Ken

      Thank you Ken! Your code almost worked. The thing is that for some reason it didn't substitute the symbols with their hex-codes, but just removed them from the string. For example please see the example of the string I'm trying to modify: "çe quil Y a Yå" I am still testing trying to figure out why it happened. Additional question - how exactly can I wrap the hex code in a <> brackets in this example?

        Changing upper to ascii and using your string, I get:

        $ perl -Mstrict -Mwarnings -E ' my $x = q{çe quil Y a Yå}; $x =~ s/([[:^ascii:]])/unpack q{H2}, $1/eg; say $x; ' c383c2a7e quil Y a Yc383c2a5

        Check that you didn't make a typo when entering your code. If you are still having problems, please post your code - as it is, I can't reproduce your problem.

        To wrap the codes in < and >, or any other characters, you can just concatenate the characters at the beginning and end of the hex code:

        $ perl -Mstrict -Mwarnings -E ' my $x = q{çe quil Y a Yå}; $x =~ s/([[:^ascii:]])/q{<} . unpack(q{H2}, $1) . q{>}/eg; say $x; ' <c3><83><c2><a7>e quil Y a Y<c3><83><c2><a5>

        -- Ken

Re: hex in regexp
by mbethke (Hermit) on May 22, 2012 at 06:35 UTC

    If you can be sure that all non-ASCII characters fit in 8 bits (in whatever codepage that hopefully you and the receiver agree upon) the suggestions so far are fine. Otherwise I'd use this, adding a "%U" before each hex sequence to make it (better, though not perfectly unless you escape existing percent signs too) distinguishable from something like "face" or "decade" that's only valid hex digits:

    perl -Mutf8 -Mstrict -Mwarnings -E ' my $x = "\x966\x959\x959\x946\x945\x961"; # Greek "foobar" $x =~ s/([[:^ascii:]])/sprintf("%%U%x", unpack q{U}, $1)/eg; say $x; '
Re: hex in regexp
by Anonymous Monk on May 22, 2012 at 03:29 UTC

    Like this

    use CGI; my $freeStuff = CGI->escape( $stuff );

    or this

    use JSON; my $freeStuff = JSON->new->ascii(1)->pretty(1)->encode([ $stuff ]);

      Sorry, I did not understand your answer at all. You see, I just started to learn Perl and sometimes may not recognize the value of the example shown. My task is to get rid of the non-Ascii symbols which were mixed with Ascii symbols in text file with multiple rows. I am not sure that solutions mentioned in the first reply do the trick. (what is the meaning of $freeStuff and $stuff anyway...). Could you please provide more vivid example or direction.

        Sorry, I did not understand your answer at all.

        I find that hard to believe :)

        I am not sure that solutions mentioned in the first reply do the trick.

        That is the great thing about code, you can try it out, and see if it meets your needs

        what is the meaning of $freeStuff and $stuff

        :) bundt cake (:

        Could you please provide more vivid example or direction.

        :) L e f t e r (:

        See, I assumed an XY Problem . I assumed what you really wanted to do was use an already existing encoder/serializer, instead of inventing your own, so I showed you two which use hex encoded characters. One which only does strings (CGI->escape) and one which does complex data structures (JSON)

        I guess since this might be homework a practical solution doesn't fit :)

Re: hex in regexp
by tinita (Parson) on May 22, 2012 at 11:09 UTC
    if you want to learn, look at the various regex suggestions and play around with them. if you want to do it right, use a module like URI::Escape (which also can escape utf8 strings, for example)