Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Identifying CJK Unified Characters

by haukex (Archbishop)
on May 28, 2020 at 06:37 UTC ( [id://11117373]=note: print w/replies, xml ) Need Help??


in reply to Identifying CJK Unified Characters

You can use perluniprops to help identify characters in certain Unicode blocks and with certain properties. For the following to work, the XML file needs to correctly declare its encoding. Also note that the newer the Perl version the better, since later Perl versions have the newer Unicode versions included.

in.xml:

<?xml version="1.0" encoding="UTF-8"?>
<root>
	<test>Hello 端子 World</test>
	<test>Föö Bär</test>
</root>

Code:

#!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; use XML::LibXML; my $dom = XML::LibXML->load_xml( location => 'in.xml' ); for my $node ($dom->findnodes('//test')) { my $text = $node->textContent; print "Before: $text\n"; $text =~ s/\p{Blk=CJK}//g; print "After: $text\n"; } #$dom->toFile('out.xml', 1);

Output:

Before: Hello 端子 World
After: Hello  World
Before: Föö Bär
After: Föö Bär

Replies are listed 'Best First'.
Re^2: Identifying CJK Unified Characters
by Anonymous Monk on May 28, 2020 at 15:37 UTC
    Beautiful, thank you. Learned something new. I had to change it to {Block: CJK_Unified_Ideographs} because I was getting an error "Can't find Unicode property definition "Blk=CJK""
      "Can't find Unicode property definition "Blk=CJK""

      That means you're on a Perl version before 5.16, since that's when that was added with Unicode 6.1 (see also perl5160delta). Note that Perl 5.14.0 was released over 9 years ago and 5.14.4 over 7 years ago. Perl 5.14 was at Unicode 6.0, while the latest Perl, 5.30, supports Unicode 12.0, and the upcoming (hopefully within a month or two) Perl 5.32 will support Unicode 13.0. Especially since you're working with Unicode, I strongly recommend you upgrade your Perl version.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117373]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-16 13:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found