audioboxer has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone!

I am trying to identify CJK Unified Characters in XML data I am parsing in order to skip over it. I have no clue where to start and looking for guidance. Web searches yield no helpful results (that I understand anyway). This is outside of my beginners realm.

Characters for example, 端子

Thanks in advance.

Replies are listed 'Best First'.
Re: Identifying CJK Unified Characters
by haukex (Archbishop) on May 28, 2020 at 06:37 UTC

    You can use perluniprops to help identify characters in certain Unicode blocks and with certain properties. For the following to work, the XML file needs to correctly declare its encoding. Also note that the newer the Perl version the better, since later Perl versions have the newer Unicode versions included.

    in.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    	<test>Hello 端子 World</test>
    	<test>Föö Bär</test>
    </root>
    

    Code:

    #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use XML::LibXML; my $dom = XML::LibXML->load_xml( location => 'in.xml' ); for my $node ($dom->findnodes('//test')) { my $text = $node->textContent; print "Before: $text\n"; $text =~ s/\p{Blk=CJK}//g; print "After: $text\n"; } #$dom->toFile('out.xml', 1);

    Output:

    Before: Hello 端子 World
    After: Hello  World
    Before: Föö Bär
    After: Föö Bär
    
      Beautiful, thank you. Learned something new. I had to change it to {Block: CJK_Unified_Ideographs} because I was getting an error "Can't find Unicode property definition "Blk=CJK""
        "Can't find Unicode property definition "Blk=CJK""

        That means you're on a Perl version before 5.16, since that's when that was added with Unicode 6.1 (see also perl5160delta). Note that Perl 5.14.0 was released over 9 years ago and 5.14.4 over 7 years ago. Perl 5.14 was at Unicode 6.0, while the latest Perl, 5.30, supports Unicode 12.0, and the upcoming (hopefully within a month or two) Perl 5.32 will support Unicode 13.0. Especially since you're working with Unicode, I strongly recommend you upgrade your Perl version.