Re: Identifying CJK Unified Characters

You can use perluniprops to help identify characters in certain Unicode blocks and with certain properties. For the following to work, the XML file needs to correctly declare its encoding. Also note that the newer the Perl version the better, since later Perl versions have the newer Unicode versions included.

in.xml:

<?xml version="1.0" encoding="UTF-8"?>
<root>
	<test>Hello 端子 World</test>
	<test>Föö Bär</test>
</root>

Code:

#!/usr/bin/env perl
use warnings;
use strict;
use open qw/:std :utf8/;
use XML::LibXML;

my $dom = XML::LibXML->load_xml( location => 'in.xml' );
for my $node ($dom->findnodes('//test')) {
    my $text = $node->textContent;
    print "Before: $text\n";
    $text =~ s/\p{Blk=CJK}//g;
    print "After: $text\n";
}
#$dom->toFile('out.xml', 1);
[download]

Output:

Before: Hello 端子 World
After: Hello  World
Before: Föö Bär
After: Föö Bär

Comment on Re: Identifying CJK Unified Characters Select or Download Code

Replies are listed 'Best First'.
Re^2: Identifying CJK Unified Characters by Anonymous Monk on May 28, 2020 at 15:37 UTC
Beautiful, thank you. Learned something new. I had to change it to {Block: CJK_Unified_Ideographs} because I was getting an error "Can't find Unicode property definition "Blk=CJK""	[reply]
Re^3: Identifying CJK Unified Characters by haukex (Archbishop) on May 28, 2020 at 19:14 UTC
"Can't find Unicode property definition "Blk=CJK"" That means you're on a Perl version before 5.16, since that's when that was added with Unicode 6.1 (see also perl5160delta). Note that Perl 5.14.0 was released over 9 years ago and 5.14.4 over 7 years ago. Perl 5.14 was at Unicode 6.0, while the latest Perl, 5.30, supports Unicode 12.0, and the upcoming (hopefully within a month or two) Perl 5.32 will support Unicode 13.0. Especially since you're working with Unicode, I strongly recommend you upgrade your Perl version.	[reply]