String match in Chinese character

hankcoder has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have run out of ideas how to workaround of doing the string/text match between 2 character in Chinese characters. If in English character, I can easily find the match with $content =~ m/\[(.*?)\]/sig but how I can do the match when working with Chinese characters?

I'm not very sure which method is best approach in doing so. Use Base64 encode or Hex encode or others encoding?

Example:

....看【厂家直销儿童加绒加厚打底裤中小童冬季】Ib.....

To match any strings in between 【....】

My current writing practice is either using Base64 or Hex encode for especially non-English input strings then store in files.

When I'm writing this question, I see perlmonks.org encode my Chinese characters into 【 for [ and 】 for ]. I forgot what this encoding is called as haven't use it for very long time. Will this be the best method to do so?

** UPDATE TO THE INITIAL QUESTION **

I updated here my latest finding whereby my actual problem is due to encoding during Form Submit not in string match. I copy my followup post here or your can direct go to (http://www.perlmonks.org/?node_id=1210729). If it is more appropriate to ask in New and separate post, do let me know and I do that. Thanks.

Ok, here is the cleaner self contain Perl script with inline FORM submit. Do make sure the form action value is "utf8_encode.pl" or change to your desire. For direct test, use this "【" Chinese character for example. For the result #3, I use unpack for this purpose. Previously found several ways and they give same results where single Chinese char "【" when split will become 3 char 227,128,144.

I'm still not quite understand of the explaination given. Almost getting the hand of it.

If I can get the encoding solved, then I think I should be able to get the Decode working as well.

The string match will be in separate processing where the code looks like this my ($result) = $str =~ m/\&\#12304\;(.*?)\&\#12305\;/sig;

#!/usr/bin/perl
######################################################################
+##########
#
#
######################################################################
+##########
use CGI ':standard';
use HTML::Entities; #-- for encode and decode string


(%FORM) = ();
if ($ENV{'REQUEST_METHOD'} eq "POST")
{
    my ($id);
    #-- extract the value inside param into %FORM hash
    foreach $id (param)
        {
            $FORM{$id} = param($id);
        }
} # // if post

print "Content-Type: text/html; charset=utf-8\n\n";

print "<h2>Encode UTF-8 Chinese Character Input</h2><br>";

print &input_form;


#---------------------------------------------------#
#---------------------------------------------------#
sub input_form
{
    my ($content) = "";
    
    my ($value) = "";
    if ($FORM{'data'} ne "")
        {
            $value = $FORM{'data'};
        }
        
    my ($encoded_value) = "";
    my ($process_content) = "";
    
    if ($FORM{'action'} eq "encode")
        {
            $encoded_value = $FORM{'encoded_value'};
            
            # !! attempt to do encoding inside perl but the $FORM{'dat
+a'} when split,
            # it become 3 char for Chinese char !!
            my (@arr) = split(//,$FORM{'data'});
            foreach my $c (@arr)
                {
                    $c = unpack('C*', $c);
                    $process_content .= "$c\n";
                }
        }
    elsif ($FORM{'action'} eq "decode")
        {
        }
        
    
    #-- content ---------------------------------
    $content = qq~
    <script type="text/javascript">
    function encodeCN(id) {
        var tstr = document.getElementById(id).value;
        var bstr = '';
        for(i=0; i<tstr.length; i++)
        {
            if(tstr.charCodeAt(i)>127)
            {
                bstr += '&#' + tstr.charCodeAt(i) + ';';
            }
            else
            {
                bstr += tstr.charAt(i);
            }
        }
        document.getElementById('encoded_value').value = bstr;
    }
    </script>
    
    <form id="fr_in" name="fr_in" action="utf8_encode.pl" style="" met
+hod="POST" enctype="application/x-www-form-urlencoded">
    <input type="hidden" onFocus="this.blur()" name="convert" id="conv
+ert" value="">
    <input type="hidden" onFocus="this.blur()" name="action" id="actio
+n" value="">
    <input type="hidden" name="encoded_value" id="encoded_value" value
+="">
    
    <textarea id="data" name="data" style="width:600px; height:200px;"
+>$value</textarea>    
    <br>
<xmp>
1. FORM submitted value:
$value

2. Encoded value thru JS before form submit:
$encoded_value

3. *Try to do encoding inside Perl*
$process_content
</xmp>
    

    <input type="button" value="Encode" onClick="encodeCN('data'); doc
+ument.getElementById('action').value='encode'; this.form.submit();">
    <input type="button" value="Decode" onClick="document.getElementBy
+Id('action').value='decode'; this.form.submit();">
    </form>
    ~;
    #--// content -------------------------------
    
    return ($content);
}
[download]

Comment on String match in Chinese character Select or Download Code

Replies are listed 'Best First'.
Re: String match in Chinese character by choroba (Cardinal) on Mar 11, 2018 at 22:26 UTC
It works for me the same way as with "English" characters. Just don't forget to tell Perl that you want to read UTF-8 from the source, input files, or use it for output. #! /usr/bin/perl use warnings; use strict; use utf8; my $string = '看【厂家直销儿童加绒加厚打底裤中小童冬季】Ib'; binmode STDOUT, ':encoding(UTF-8)'; while ($string =~ /【(.?)】/g) { print "Match: $string\n"; } ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7*2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: String match in Chinese character by hankcoder (Scribe) on Mar 11, 2018 at 22:43 UTC
Thanks choroba for reply. The suggested method is as my very old way of coding style. However, I will encounter an issue with my editor whereby I must have the .pl file saved in UTF-8 encoded type in order to hard code it like *while ($string =~ /【(.?)】/g)** Furthermore, there are some other issue which I can't remember so I ditch such direct use of Chinese character in my codes.	[reply]
Re^3: String match in Chinese character by pryrt (Abbot) on Mar 11, 2018 at 23:57 UTC
then escape the unicode in the regex: use warnings; use strict; use utf8; my $string = '看【厂家直销儿童加绒加厚打底裤中小童冬季】Ib'; binmode STDOUT, ':encoding(UTF-8)'; while ($string =~ /\x{3010}(.*?)\x{3011}/g) { print "Match: $string\n"; } (I left the `use utf8` so that I could easily include the same string that choroba did. However you get the string is fine)	[reply] [d/l]
Re^4: String match in Chinese character by 1nickt (Canon) on Mar 12, 2018 at 01:42 UTC
Re^2: String match in Chinese character by hankcoder (Scribe) on Mar 12, 2018 at 11:44 UTC
Thank you for all the help guys. I have just noticed the Chinese characters were screwed up during FORM submit encoding where "【" should be 12304 when encoded but it become splited into 3 parts: 227,128,144. For the moment, I yet found out how to join up "227,128,144" into "12304". I have narrowed down to FORM URI Safe encoding causes this. My current test codes become too messy to post here. If anyone got any idea, I would be really appreciate if could point out the most possible cause of this. For the moment, I use Javascript function to ".charCodeAt" before form submit to make each encoded character look like "【" for "`【`" then only I can use match string in Perl to extract strings inside "【" and "】". Incase you guys interested in the JS, here is the code: `function encodeCN(id) { var tstr = document.getElementById(id).value; var bstr = ''; for(i=0; i<tstr.length; i++) { if(tstr.charCodeAt(i)>127) { bstr += '&#' + tstr.charCodeAt(i) + ';'; } else { bstr += tstr.charAt(i); } } document.getElementById(id).value = bstr; }` [download]	[reply] [d/l] [select]
Re^3: String match in Chinese character by Corion (Patriarch) on Mar 12, 2018 at 12:26 UTC
You will most likely either need to look at the `Content-Type` header of your form submission request or, if that fails, guess. I think in the past browsers used to submit form data in the same encoding as the page the HTML form was on, but I hope that nowadays with fairly recent browsers, they always send the content characterset/encoding with the request: `Content-Type: text/html; charset=utf-8` [download] Ideally, your framework already looks at that and uses the appropriate Encode`::decode` call, but I'm not sure what your users browsers actually send and whether that can be decoded without problems. Maybe seeing some more of the code that receives the input and of the HTML that is used to display the FORM to the client can help us narrow the problem down somewhat.	[reply] [d/l] [select]
Re^3: String match in Chinese character by pryrt (Abbot) on Mar 12, 2018 at 13:15 UTC
where "【" should be 12304 when encoded but it become splited into 3 parts: 227,128,144 Yes, that is properly encoded UTF-8. Codepoints from U+0800 to U+FFFF are to be encoded with 3 bytes. The codepoint 12304, which is 0x3010 in hex, usually using the U+3010 notation for Unicode, should be encoded as the three bytes 0xE3, 0x80, 0x90. Working it out: Codepoint 12304 (codepoint 0x3010 in hex, U+3010) hex 0x3010 hex 3 0 1 0 bin 0011 0000 0001 0000 xxxx yyyy yyzz zzzz (use x, y, and z to indicate the groups of bits in the codepoints) encoding: ....xxxx ..yyyyyy ..zzzzzz (use xyz as above; use dots . to indicate bits specified in UTF-8 encoding) bin 11100011 10000000 10010000 hex E3 80 90 dec 227 128 144 ... which is what you listed (This is, btw, why Corion told you to look for the `charset=utf-8` in the `Content-type`, because he recognized those three bytes were the appropriate UTF-8 encoding of the `LEFT BLACK LENTICULAR BRACKET (U+3010)` )	[reply] [d/l] [select]
Re: String match in Chinese character by hankcoder (Scribe) on Mar 12, 2018 at 14:36 UTC
Ok, here is the cleaner self contain Perl script with inline FORM submit. Do make sure the form action value is "utf8_encode.pl" or change to your desire. For direct test, use this "【" Chinese character for example. For the result #3, I use unpack for this purpose. Previously found several ways and they give same results where single Chinese char "【" when split will become 3 char 227,128,144. I'm still not quite understand of the explaination given. Almost getting the hand of it. If I can get the encoding solved, then I think I should be able to get the Decode working as well. The string match will be in separate processing where the code looks like this `my ($result) = $str =~ m/\&\#12304\;(.?)\&\#12305\;/sig;` #!/usr/bin/perl ###################################################################### +########## # # ###################################################################### +########## use CGI ':standard'; use HTML::Entities; #-- for encode and decode string (%FORM) = (); if ($ENV{'REQUEST_METHOD'} eq "POST") { my ($id); #-- extract the value inside param into %FORM hash foreach $id (param) { $FORM{$id} = param($id); } } # // if post print "Content-Type: text/html; charset=utf-8\n\n"; print "<h2>Encode UTF-8 Chinese Character Input</h2><br>"; print &input_form; #---------------------------------------------------# #---------------------------------------------------# sub input_form { my ($content) = ""; my ($value) = ""; if ($FORM{'data'} ne "") { $value = $FORM{'data'}; } my ($encoded_value) = ""; my ($process_content) = ""; if ($FORM{'action'} eq "encode") { $encoded_value = $FORM{'encoded_value'}; # !! attempt to do encoding inside perl but the $FORM{'dat +a'} when split, # it become 3 char for Chinese char !! my (@arr) = split(//,$FORM{'data'}); foreach my $c (@arr) { $c = unpack('C', $c); $process_content .= "$c\n"; } } elsif ($FORM{'action'} eq "decode") { } #-- content --------------------------------- $content = qq~ <script type="text/javascript"> function encodeCN(id) { var tstr = document.getElementById(id).value; var bstr = ''; for(i=0; i<tstr.length; i++) { if(tstr.charCodeAt(i)>127) { bstr += '&#' + tstr.charCodeAt(i) + ';'; } else { bstr += tstr.charAt(i); } } document.getElementById('encoded_value').value = bstr; } </script> <form id="fr_in" name="fr_in" action="utf8_encode.pl" style="" met +hod="POST" enctype="application/x-www-form-urlencoded"> <input type="hidden" onFocus="this.blur()" name="convert" id="conv +ert" value=""> <input type="hidden" onFocus="this.blur()" name="action" id="actio +n" value=""> <input type="hidden" name="encoded_value" id="encoded_value" value +=""> <textarea id="data" name="data" style="width:600px; height:200px;" +>$value</textarea> <br> <xmp> 1. FORM submitted value: $value 2. Encoded value thru JS before form submit: $encoded_value 3. Try to do encoding inside Perl $process_content </xmp> <input type="button" value="Encode" onClick="encodeCN('data'); doc +ument.getElementById('action').value='encode'; this.form.submit();"> <input type="button" value="Decode" onClick="document.getElementBy +Id('action').value='decode'; this.form.submit();"> </form> ~; #--// content ------------------------------- return ($content); } [download]	[reply] [d/l] [select]
Re: String match in Chinese character by Anonymous Monk on Mar 12, 2018 at 13:49 UTC
Also, the OP contains a subtle error. As per utf8, the `use utf8` pragma only instructs the Perl compiler to expect UTF8 in the source code. It does not affect the handling of input or output files.	[reply]
Re^2: String match in Chinese character by Your Mother (Archbishop) on Mar 12, 2018 at 17:09 UTC
The OP does not even contain the string "use utf8" so, no. If you meant choroba's and pryrt's and 1nickt's code you can 1) assume it's right if you don't know any better because it's them, 2) try the code without it to see what happens so you learn something. Option 2 is impossible though, so…	[reply]
Re^3: String match in Chinese character by hankcoder (Scribe) on Mar 13, 2018 at 00:16 UTC
Your Mother, I assume you replying to Anonymous Monk, right?	[reply]
Re^4: String match in Chinese character by Your Mother (Archbishop) on Mar 13, 2018 at 01:28 UTC


Syntactic Confectionery Delight
	PerlMonks