format text which mixed english and chinese characters.

Category:	Text Processing
Author/Contact Info	shijialee gmail.com
Description:	we at wiki.perlchina.org translate lots nice english perl articles to chinese. there are a lot of articles submitted with badly formatted style, especially the case when chinese characters and english squeeze together. this is a quick hack to add one space to seperate english words/digits from chinese (before and after chinese characters). so that there will not be {chinese}etc{chinese}, instead will be {chinese} etc {chinese}
#!/usr/bin/perl -w use strict; my $format; while (<DATA>) { my $line = $_; # this line has alphbet, try to seperate if ($line =~ /[a-zA-Z\d]/) { LINE: while (1) { # (([\x81-\xfe][\x40-\xfe])) greedy match on chinese. # line starts with chinese if ($line =~ /(([\x81-\xfe][\x40-\xfe]))/ && $1) { my $e = $1; # update line $line = $'; $format .= ($line=~ /^\s/) ? $e : $e." "; } elsif ($line =~ /(.*?)([\x81-\xfe][\x40-\xfe])/) { # line starts with non-chinese chars my $e = $1; # the second match eats one chinese char, so get it ba +ck. $line = $2.$'; # add space after english char if it isn't ended with +space. $format .= ($e=~/\s$/) ? $e : $e." "; } else { $format .= $line; last LINE; } } } else { $format .= $line; } } print $format; __DATA__ 在 2004 年7月的开源大会时能比CPython更快的执行Python的字节码 “Perl现在生机勃勃 Perl 6被建议不应只作为Perl的新的实现 perl 5.8.x,仍然生机勃勃,Jarkko Hietaniemi今年的早些时候把在2003年10月，发布了

Comment on format text which mixed english and chinese characters. Download Code

Replies are listed 'Best First'.
Re: format text which mixed english and chinese characters. by graff (Chancellor) on Feb 20, 2005 at 21:49 UTC
A few points: You don't say so, but your script is hard-coded to handle text that uses CP936 encoding for Chinese. It would probably work with other GB-based encodings as well as Big5, which all use the same basic strategy, but it would go wrong if the input text turned out to be any sort of unicode. All the encodings for Chinese (including unicode) have a section of code points for "wide" versions of the ASCII characters: in addition to the single-byte ASCII digits, alphabet, punctuation marks and brackets, there are two-byte renderings for these characters also -- but your code treats all 2-byte characters as "Chinese". (It looks like there's a two-byte comma in the last line of your DATA.) The code could be written more simply, especially if you have Perl 5.8.x and convert the text to internal utf8 before applying regexes; depending on what version of Perl you're using, the unicode might slow it down noticeably (probably only a problem with 5.8.0 and 5.8.1), but you gain a lot in clarity and maintainability. Here's how the code could look if the data is converted to utf8 internally -- I'm also using simpler logic: split the input strings into chunks of ideographic and non-ideographic characters, then re-join the chunks, adding spaces where necessary. This will produce slightly different output than the code you posted, especially where the input text contains "fullwidth" (2-byte) versions of ASCII characters, but it might be easier to tweak in order to make the spacing come out the way you want. #!/usr/bin/perl -w use strict; # NOTE: use a pipe or redirection to feed input data to this script binmode( STDIN, ":encoding(cp936)" ); binmode( STDOUT, ":encoding(cp936)" ); # (you could add a command-line option to select # a different input/output character encoding) while (<>) { # first, convert any "fullwidth" ascii characters to normal ascii # (ff01-ff5e is the unicode range for "fullwidth ascii", and it # can be transferred directly to the ascii range 0x21-0x7e): tr/\x{ff01}-\x{ff5e}/!-~/; # now split into chunks: ideographic vs. non-ideographic # note that we put capturing parens around the split regex): my @chunks = split /(\p{Ideographic}+)/; # put the chunks back together, adding spaces to non-ideographics as n +eeded my $out = ''; if ( @chunks == 1 ) { $out = shift @chunks; } else { for ( my $i=0; $i <= $#chunks; $i++ ) { $chunks[$i] =~ s/([!-~])$/$1 / unless $i == $#chunks; $chunks[$i] =~ s/^([!-~])/ $1/ unless $i == 0; $out .= $chunks[$i]; } } print $out; } [download]	[reply] [d/l]
Re^2: format text which mixed english and chinese characters. by Qiang (Friar) on Feb 21, 2005 at 03:57 UTC
I have been wanting to read unicode as i don't have any knowledge of it. that regex i used is off from the web. I do not have perl 5.8 to test the code (require Encode?). but i am sure i could use it later. thanks again!	[reply]

Back to Code Catacombs