Polish Characters

ReiVo has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all,
I am having lot of trouble in analyzing a piece of text in
Polish. I want to read all words of a text, and count them.
OK, what you normally do is to set up pattern matching
using something like \w. That does not work, it leaves the
special Polish letters kind of the strange l and friends out.
Next approach,
use POSIX qw(locale_h) ;
setlocale(LC_ALL,"Polish_Poland") or die "Could not set locale";
that runs, does not complain, but, same effect as before.
I am running the activestate distro un WinXP.
Thx a lot for every hint
Reinhard

Comment on Polish Characters

Replies are listed 'Best First'.
Re: Polish Characters by Corion (Patriarch) on May 20, 2010 at 14:42 UTC
You will have to think about what encoding your source code is in, what encoding your target data is in and what encoding your output file should be in. Then, you will need to use Encode::decode resp. utf8 (or encoding) to transform your input and output between the wanted encodings and to tell Perl about the encoding of your source code.	[reply]
Re: Polish Characters by moritz (Cardinal) on May 20, 2010 at 15:15 UTC
I tend to avoid locales, and rely on Unicode semantics for regex matching, because it involves less opaque magic, and also recognizes word characters from other languages (which I consider a feature). As Corion mentioned, you have to find out what encodings your input data and script are in, and decode it before using string operations on it. See encodings and Unicode in Perl and the Perl Unicode and UTF-8 wikibook for detailed information	[reply]
Re: Polish Characters by almut (Canon) on May 20, 2010 at 15:05 UTC
pattern matching using something like \w In addition to what Corion said, note that there are also various unicode category properties available via escapes, e.g. `\p{L}` (or long: `\p{Letter}`) for letters, etc. that you can make use of, once you've properly decoded your input. See perlunicode for details.	[reply] [d/l] [select]
Re: Polish Characters by zby (Vicar) on May 21, 2010 at 10:13 UTC
Hi there, You posted that to the krakow.pm mailing list and I answered you there - did you not receive my email? Maybe the list server is not working correctly. Anyway here is the example that I typed for you: `use utf8; binmode(STDOUT, ":utf8"); my $string = "azsc"; while( $string =~ /(\w)/g ){ print $1; } print "\n"; __OUTPUT__ azsc` [download] Replace the 'azsc' with utf8 encoded characters - and it should work, unfortunately PerlMonks mangles the characters when I try to input them here. This does not depend on the locale - but instead it is using the character semantic, it think this is the more modern approach. The important point is that your data needs to be correctly decoded into the characters. Here I used the utf8 pragma so I could put the characters into the program text, but if you read the data from outside sources you need to decode it - this is covered in multiple online sources for example: `open(my $fh, "<:encoding(UTF-8)", "filename") \|\| die "can't open UTF-8 encoded filename: $!";` [download] this snipped is part of the documentation for the 'open' Perl function.	[reply] [d/l] [select]


Pathologically Eclectic Rubbish Lister
	PerlMonks