Identifying CJK Unified Characters

audioboxer has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone!

I am trying to identify CJK Unified Characters in XML data I am parsing in order to skip over it. I have no clue where to start and looking for guidance. Web searches yield no helpful results (that I understand anyway). This is outside of my beginners realm.

Characters for example, 端子

Thanks in advance.

Comment on Identifying CJK Unified Characters

Replies are listed 'Best First'.
Re: Identifying CJK Unified Characters by haukex (Archbishop) on May 28, 2020 at 06:37 UTC
You can use perluniprops to help identify characters in certain Unicode blocks and with certain properties. For the following to work, the XML file needs to correctly declare its encoding. Also note that the newer the Perl version the better, since later Perl versions have the newer Unicode versions included. `in.xml`: <?xml version="1.0" encoding="UTF-8"?> <root> <test>Hello 端子 World</test> <test>Föö Bär</test> </root> Code: `#!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use XML::LibXML; my $dom = XML::LibXML->load_xml( location => 'in.xml' ); for my $node ($dom->findnodes('//test')) { my $text = $node->textContent; print "Before: $text\n"; $text =~ s/\p{Blk=CJK}//g; print "After: $text\n"; } #$dom->toFile('out.xml', 1);` [download] Output: Before: Hello 端子 World After: Hello World Before: Föö Bär After: Föö Bär	[reply] [d/l] [select]
Re^2: Identifying CJK Unified Characters by Anonymous Monk on May 28, 2020 at 15:37 UTC
Beautiful, thank you. Learned something new. I had to change it to {Block: CJK_Unified_Ideographs} because I was getting an error "Can't find Unicode property definition "Blk=CJK""	[reply]
Re^3: Identifying CJK Unified Characters by haukex (Archbishop) on May 28, 2020 at 19:14 UTC
"Can't find Unicode property definition "Blk=CJK"" That means you're on a Perl version before 5.16, since that's when that was added with Unicode 6.1 (see also perl5160delta). Note that Perl 5.14.0 was released over 9 years ago and 5.14.4 over 7 years ago. Perl 5.14 was at Unicode 6.0, while the latest Perl, 5.30, supports Unicode 12.0, and the upcoming (hopefully within a month or two) Perl 5.32 will support Unicode 13.0. Especially since you're working with Unicode, I strongly recommend you upgrade your Perl version.	[reply]