ISO 8859-1 characters and \w \b etc.

Melroch has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: ISO 8859-1 characters and \w \b etc. by graff (Chancellor) on Jun 28, 2004 at 02:01 UTC
If you have Perl 5.8, you could convert the input 8859-1 data to utf8, do the regex matching, and convert back to 8859-1 for output (assuming you don't want to just switch everything over to utf8 globally). Something like this would work: `#!/usr/bin/perl use strict; use Encode; while (<>) { my $utf8 = decode( 'iso8859-1', $_ ); my @words = ( $utf8 =~ /\b(\w+)\b/g ); print join "\n", map { encode( 'iso8859-1', $_ ) } @words; print "\n"; }` [download] The output is one "word" per line, treating accented letters as "\w", and things such as currrency symbols, quotes, inverted question mark, non-breaking-space, etc, as things that trigger "\b". Assuming the 8859-1 text is in a file, the example above works as follows (let's call the script "latin1-tokenizer"): `latin1-tokenizer < latin1.txt > latin1.tkns` [download] That example could also be written without the encode/decode calls, using PerlIO layers instead: `#!/usr/bin/perl use strict; open( IN, "<:encoding(iso8859-1)", $ARGV[0] ) or die "couldn't read $A +RGV[0]: $!"; binmode STDOUT, ":encoding(iso8859-1)"; while (<IN>) { my @words = ( /\b(\w+)\b/g ); print join "\n", @words; print "\n"; } # run it like this: tokenizer latin1.txt > latin1.tkns` [download] (I'm unsure about posting test data with actual latin1 characters, so I leave it to you to try it on your own data.)	[reply] [d/l] [select]
Re: ISO 8859-1 characters and \w \b etc. by Joost (Canon) on Jun 27, 2004 at 17:42 UTC
This depends on the language, and apparently can conflict with unicode, but take a look at perldoc perllocale (sepecifically, the section on LC_CTYPE) I haven't used locales at all, so I can't really help you any further than that. Edit: fixed url. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: ISO 8859-1 characters and \w \b etc. by dpavlin (Friar) on Jun 27, 2004 at 22:20 UTC
I know that this node is somewhat duplicate of things said above, but as a ISO-8859-2 user, I would like to emphasize that you will need to setup LC_CTYPE and use locale. You should also consider setuping LC_COLLATE so that sort also uses locale. You will have to have locale installed on your system. Try setting enviroment variables and running perl -v to see if perl picks up locale (it will complain if it doesn't). Having said that, locale setup is done per language and country (that's why locale for Croatia is hr_HR and for USA en_US). You might also use locale aliases (defined in /usr/lib/X11/locale/locale.alias). It might be enough just to add use locale; in your code. If you need. Example follows (for Croatia with it's funny accented characters; we use ISO-8859-2, but principle is the same). `#!/usr/bin/perl -w use strict; use locale; use POSIX qw(locale_h); setlocale(LC_CTYPE, 'hr_HR'); setlocale(LC_COLLATE, 'hr_HR'); my $text = "foo čevapčić bar"; print join(", ",sort split(/\W/,$text)),"\n";` [download] If you are not bothered with changing system-wide locale, you can also setup your /etc/profile and apache's httpd.conf with enviroment variables and drop setlocale from code. 2share!2flame...	[reply] [d/l]
Re^2: ISO 8859-1 characters and \w \b etc. by Melroch (Acolyte) on Jun 27, 2004 at 17:57 UTC
Thanks. Truth to say I have looked at `perldoc perllocale` several times and not got any wiser, I'm afraid. I guess what I'm really looking for is a plain English description of how to get and set locales. The workaround of using numerals instead of letters only gets you so far... /Melroch	[reply]
Re^3: ISO 8859-1 characters and \w \b etc. by Joost (Canon) on Jun 27, 2004 at 18:24 UTC
See the ENVIRONMENT secion in perllocale, and maybe your local manpage for "locale". You might have to install extra locales you want to use (my system only has the "C" and "POSIX" locales, apparently). Basically you can set a couple of environment variables, and that will determine the locale your perl program will run under. Which locales are supported is system-dependent, I can see mine using "locale -a". Hope this helps, Joost. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: ISO 8859-1 characters and \w \b etc. by theorbtwo (Prior) on Jun 28, 2004 at 01:44 UTC
Here's the answer, and it's a bit confusing. A perl string has a magic bit attached to it, the UTF8 bit. If it is off, your string is assumed to be in latin1. That's fairly clear. What's not clear is that when the string is a bunch of utf8 chars, ö is considered a letter (for example), but when it's latin1 characters, ö is not a letter (unless using `locale`). The solution is to make your strings utf8 strings, by using Encode. Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply] [d/l]