http://qs321.pair.com?node_id=432825
Category: Text Processing
Author/Contact Info shijialee gmail.com
Description: we at wiki.perlchina.org translate lots nice english perl articles to chinese. there are a lot of articles submitted with badly formatted style, especially the case when chinese characters and english squeeze together.

this is a quick hack to add one space to seperate english words/digits from chinese (before and after chinese characters). so that there will not be {chinese}etc{chinese}, instead will be {chinese} etc {chinese}

#!/usr/bin/perl -w

use strict;

my $format;
while (<DATA>) {
    my $line = $_;
    # this line has alphbet, try to seperate
    if ($line =~ /[a-zA-Z\d]/) {
        LINE: while (1) {
            # (([\x81-\xfe][\x40-\xfe])*) greedy match on chinese.
            # line starts with chinese
             if ($line =~ /(([\x81-\xfe][\x40-\xfe])*)/ && $1) {
                my $e = $1;
                # update line
                $line = $';                
                $format .= ($line=~ /^\s/) ? $e : $e." ";
            } elsif ($line =~ /(.*?)([\x81-\xfe][\x40-\xfe])/) {
                # line starts with non-chinese chars
                my $e = $1;
                # the second match eats one chinese char, so get it ba
+ck.
                $line = $2.$';
                # add space after english char if it isn't ended with 
+space.
                $format .= ($e=~/\s$/) ? $e : $e." ";
            } else {
                $format .= $line;
                last LINE;
            }
        }
    } else {
        $format .= $line;
    }
}

print $format;

__DATA__

在 2004 年7月的开源大会时能比CPython更快的执行Python的字节码
“Perl现在生机勃勃
Perl 6被建议不应只作为Perl的新的实现
perl 5.8.x,仍然生机勃勃,Jarkko Hietaniemi今年的早些时候把
在2003年10月,发布了
Replies are listed 'Best First'.
Re: format text which mixed english and chinese characters.
by graff (Chancellor) on Feb 20, 2005 at 21:49 UTC
    A few points:
    • You don't say so, but your script is hard-coded to handle text that uses CP936 encoding for Chinese. It would probably work with other GB-based encodings as well as Big5, which all use the same basic strategy, but it would go wrong if the input text turned out to be any sort of unicode.
    • All the encodings for Chinese (including unicode) have a section of code points for "wide" versions of the ASCII characters: in addition to the single-byte ASCII digits, alphabet, punctuation marks and brackets, there are two-byte renderings for these characters also -- but your code treats all 2-byte characters as "Chinese". (It looks like there's a two-byte comma in the last line of your DATA.)
    • The code could be written more simply, especially if you have Perl 5.8.x and convert the text to internal utf8 before applying regexes; depending on what version of Perl you're using, the unicode might slow it down noticeably (probably only a problem with 5.8.0 and 5.8.1), but you gain a lot in clarity and maintainability.

    Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary.

    This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want.

    #!/usr/bin/perl -w use strict; # NOTE: use a pipe or redirection to feed input data to this script binmode( STDIN, ":encoding(cp936)" ); binmode( STDOUT, ":encoding(cp936)" ); # (you could add a command-line option to select # a different input/output character encoding) while (<>) { # first, convert any "fullwidth" ascii characters to normal ascii # (ff01-ff5e is the unicode range for "fullwidth ascii", and it # can be transferred directly to the ascii range 0x21-0x7e): tr/\x{ff01}-\x{ff5e}/!-~/; # now split into chunks: ideographic vs. non-ideographic # note that we put capturing parens around the split regex): my @chunks = split /(\p{Ideographic}+)/; # put the chunks back together, adding spaces to non-ideographics as n +eeded my $out = ''; if ( @chunks == 1 ) { $out = shift @chunks; } else { for ( my $i=0; $i <= $#chunks; $i++ ) { $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks; $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0; $out .= $chunks[$i]; } } print $out; }
      I have been wanting to read unicode as i don't have any knowledge of it. that regex i used is off from the web.

      I do not have perl 5.8 to test the code (require Encode?). but i am sure i could use it later. thanks again!