Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Identifying scripts (writing systems)

by AppleFritter (Vicar)
on Sep 16, 2014 at 21:32 UTC ( [id://1100846]=CUFP: print w/replies, xml ) Need Help??

Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user|programmer|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think.

I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right?

I started by hardcoding a number of scripts to match my snippets of text against:

my %scripts; foreach (@lines) { my $script = m/^\p{Script=Latin}*$/ ? "Latin" : m/^\p{Script=Cyrillic}*$/ ? "Cyrillic" : m/^\p{Script=Han}*$/ ? "Han" : # ... "(unknown)"; $scripts{$script}++; }

Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop?

my %scripts; LINE: foreach my $line (@lines) { foreach my $script (@known_scripts) { next unless $line =~ m/^\p{Script=$script}*$/; $scripts{$script}++; next LINE; } $scripts{'(unknown)'}++; }

So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "\p{Script: ...}" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list:

my @known_scripts = ( "Arabic", "Armenian", "Avestan", "Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br +aille", "Buginese", "Buhid", "Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee", "Coptic", "Cuneiform", "Cypriot", "Cyrillic", # ... );

Unfortunately, when I ran the resulting script, Perl complained:

Can't find Unicode property definition "Script=Chakma" at (...) line ( +...)

What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on perl.org, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be.

What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at lib/unicore/Scripts.txt for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as scripts-58.txt etc.), and then process:

$ for i in 8 10 12 14 16 18 20; do perl scripts.pl scripts-5$i.txt >5$ +i.lst; done $ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li +ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done $

scripts.pl was a little helper script to extract script information (apologies for the confusing terminology, BTW):

#!/usr/bin/perl use strict; use warnings; use feature qw/say/; my %scripts; while(<>) { next unless m/; ([A-Za-z_]*) #/; $scripts{$1}++; } $, = "\n"; say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/ +ge; qq/"$_"/ } keys %scripts;

I admit, I got lazy at this point and manually combined those files (58.lst, as well as 510.new, 512.new etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version:

# New Unicode scripts added in Perl 5.xx my %uniscripts = ( '8' => [ "Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid", "Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret", "Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja +rati", "Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana", "Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin", "Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O +riya", "Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil", "Telugu", "Thaana", "Thai", "Tibetan", "Yi" ], '10' => [ "Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif +orm", "Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B", "New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa", "Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh", "Ugaritic" ], '12' => [ "Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs", "Imperial_Aramaic", "Inscriptional_Pahlavi", "Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li", "Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik +i", "Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan", "Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai" ], '14' => [ "Batak", "Brahmi", "Mandaic" ], '16' => [ "Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao", "Sharada", "Sora_Sompeng", "Takri" ], '18' => [ ], '20' => [ ], ); (my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/; my @known_scripts; foreach (keys %uniscripts) { next if $ver < $_; push @known_scripts, @{ $uniscripts{$_} }; } print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts + known.\n";

The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result:

# tentative! '22' => [ "Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra +ntha", "Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean", "Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia +n", "Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau", "Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi" ],

But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics:

print "Found " . scalar keys(%scripts) . " scripts:\n"; print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b } + keys %scripts);

(You'll note that in the above two snippets, I'm using print rather than say, BTW. That's intentional: say is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.)

Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output:

Running on Perl v5.14.2, 95 scripts known. Found 18 scripts: Arabic: 21 line(s) Bengali: 2 line(s) Cyrillic: 12 line(s) Devanagari: 3 line(s) Georgian: 1 line(s) Greek: 1 line(s) Gujarati: 1 line(s) Gurmukhi: 1 line(s) Han: 29 line(s) Hangul: 3 line(s) Hebrew: 1 line(s) Hiragana: 1 line(s) Katakana: 1 line(s) Latin: 647 line(s) Sinhala: 1 line(s) Tamil: 4 line(s) Telugu: 1 line(s) Thai: 1 line(s)

Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility.

What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above.

Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl documents Perl 5.6; the \p{Script=...} syntax for character classes doesn't exist yet, I think, but one could write \p{In...} instead, e.g. \p{InArabic}, \p{InTamil} and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.)

Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if @lines is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring.

But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.

Replies are listed 'Best First'.
Re: Identifying scripts (writing systems)
by Anonymous Monk on Sep 16, 2014 at 22:28 UTC

    Perhaps charscript() and charscripts() from Unicode::UCD could help?

      Ah, I had a hunch there would be another solution, one that I'd overlooked! Indeed, you're right, the following also works (and is faster to boot):

      use Unicode::UCD qw/charscript/; # ... my %scripts; $scripts{charscript ord substr $_, 0, 1}++ foreach (@lines);

      And it appears that Unicode::UCD was added to the Perl core in 5.8, so I can't even claim victory on that front... on the other hand, I enjoyed working on this, so although I could've saved my time, I didn't strictly waste it. :)

Re: Identifying scripts (writing systems)
by RonW (Parson) on Sep 16, 2014 at 21:46 UTC
    Very well done, brother (sister?)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://1100846]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (2)
As of 2024-04-26 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found