Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

format text which mixed english and chinese characters.

by Qiang (Friar)
on Feb 20, 2005 at 04:18 UTC ( #432825=sourcecode: print w/replies, xml ) Need Help??
Category: Text Processing
Author/Contact Info shijialee
Description: we at translate lots nice english perl articles to chinese. there are a lot of articles submitted with badly formatted style, especially the case when chinese characters and english squeeze together.

this is a quick hack to add one space to seperate english words/digits from chinese (before and after chinese characters). so that there will not be {chinese}etc{chinese}, instead will be {chinese} etc {chinese}

#!/usr/bin/perl -w

use strict;

my $format;
while (<DATA>) {
    my $line = $_;
    # this line has alphbet, try to seperate
    if ($line =~ /[a-zA-Z\d]/) {
        LINE: while (1) {
            # (([\x81-\xfe][\x40-\xfe])*) greedy match on chinese.
            # line starts with chinese
             if ($line =~ /(([\x81-\xfe][\x40-\xfe])*)/ && $1) {
                my $e = $1;
                # update line
                $line = $';                
                $format .= ($line=~ /^\s/) ? $e : $e." ";
            } elsif ($line =~ /(.*?)([\x81-\xfe][\x40-\xfe])/) {
                # line starts with non-chinese chars
                my $e = $1;
                # the second match eats one chinese char, so get it ba
                $line = $2.$';
                # add space after english char if it isn't ended with 
                $format .= ($e=~/\s$/) ? $e : $e." ";
            } else {
                $format .= $line;
                last LINE;
    } else {
        $format .= $line;

print $format;


在 2004 年7月的开源大会时能比CPython更快的执行Python的字节码
Perl 6被建议不应只作为Perl的新的实现
perl 5.8.x,仍然生机勃勃,Jarkko Hietaniemi今年的早些时候把
Replies are listed 'Best First'.
Re: format text which mixed english and chinese characters.
by graff (Chancellor) on Feb 20, 2005 at 21:49 UTC
    A few points:
    • You don't say so, but your script is hard-coded to handle text that uses CP936 encoding for Chinese. It would probably work with other GB-based encodings as well as Big5, which all use the same basic strategy, but it would go wrong if the input text turned out to be any sort of unicode.
    • All the encodings for Chinese (including unicode) have a section of code points for "wide" versions of the ASCII characters: in addition to the single-byte ASCII digits, alphabet, punctuation marks and brackets, there are two-byte renderings for these characters also -- but your code treats all 2-byte characters as "Chinese". (It looks like there's a two-byte comma in the last line of your DATA.)
    • The code could be written more simply, especially if you have Perl 5.8.x and convert the text to internal utf8 before applying regexes; depending on what version of Perl you're using, the unicode might slow it down noticeably (probably only a problem with 5.8.0 and 5.8.1), but you gain a lot in clarity and maintainability.

    Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary.

    This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want.

    #!/usr/bin/perl -w use strict; # NOTE: use a pipe or redirection to feed input data to this script binmode( STDIN, ":encoding(cp936)" ); binmode( STDOUT, ":encoding(cp936)" ); # (you could add a command-line option to select # a different input/output character encoding) while (<>) { # first, convert any "fullwidth" ascii characters to normal ascii # (ff01-ff5e is the unicode range for "fullwidth ascii", and it # can be transferred directly to the ascii range 0x21-0x7e): tr/\x{ff01}-\x{ff5e}/!-~/; # now split into chunks: ideographic vs. non-ideographic # note that we put capturing parens around the split regex): my @chunks = split /(\p{Ideographic}+)/; # put the chunks back together, adding spaces to non-ideographics as n +eeded my $out = ''; if ( @chunks == 1 ) { $out = shift @chunks; } else { for ( my $i=0; $i <= $#chunks; $i++ ) { $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks; $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0; $out .= $chunks[$i]; } } print $out; }
      I have been wanting to read unicode as i don't have any knowledge of it. that regex i used is off from the web.

      I do not have perl 5.8 to test the code (require Encode?). but i am sure i could use it later. thanks again!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://432825]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2020-10-21 01:58 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (212 votes). Check out past polls.