in reply to Identifying CJK Unified Characters
You can use perluniprops to help identify characters in certain Unicode blocks and with certain properties. For the following to work, the XML file needs to correctly declare its encoding. Also note that the newer the Perl version the better, since later Perl versions have the newer Unicode versions included.
in.xml:
<?xml version="1.0" encoding="UTF-8"?> <root> <test>Hello 端子 World</test> <test>Föö Bär</test> </root>
Code:
#!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use XML::LibXML; my $dom = XML::LibXML->load_xml( location => 'in.xml' ); for my $node ($dom->findnodes('//test')) { my $text = $node->textContent; print "Before: $text\n"; $text =~ s/\p{Blk=CJK}//g; print "After: $text\n"; } #$dom->toFile('out.xml', 1);
Output:
Before: Hello 端子 World After: Hello World Before: Föö Bär After: Föö Bär
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Identifying CJK Unified Characters
by Anonymous Monk on May 28, 2020 at 15:37 UTC | |
by haukex (Archbishop) on May 28, 2020 at 19:14 UTC |