Identifying scripts (writing systems)

Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user|programmer|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think.

I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right?

I started by hardcoding a number of scripts to match my snippets of text against:

my %scripts;
foreach (@lines) {
    my $script = 
        m/^\p{Script=Latin}*$/      ? "Latin"       :
        m/^\p{Script=Cyrillic}*$/   ? "Cyrillic"    :
        m/^\p{Script=Han}*$/        ? "Han"         :
        # ...
        "(unknown)";
    
    $scripts{$script}++;
}
[download]

Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop?

my %scripts;
LINE: foreach my $line (@lines) {
    foreach my $script (@known_scripts) {
        next unless $line =~ m/^\p{Script=$script}*$/;
        
        $scripts{$script}++;
        next LINE;
    }
    
    $scripts{'(unknown)'}++;
}
[download]

So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "\p{Script: ...}" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list:

my @known_scripts = (
    "Arabic", "Armenian", "Avestan",
    "Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br
+aille",
    "Buginese", "Buhid",
    "Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee",
    "Coptic", "Cuneiform", "Cypriot", "Cyrillic",
    # ...
);
[download]

Unfortunately, when I ran the resulting script, Perl complained:

Can't find Unicode property definition "Script=Chakma" at (...) line (
+...)
[download]

What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on perl.org, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be.

What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at lib/unicore/Scripts.txt for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as scripts-58.txt etc.), and then process:

$ for i in 8 10 12 14 16 18 20; do perl scripts.pl scripts-5$i.txt >5$
+i.lst; done
$ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li
+ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done
$
[download]

scripts.pl was a little helper script to extract script information (apologies for the confusing terminology, BTW):

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

my %scripts;
while(<>) {
    next unless m/; ([A-Za-z_]*) #/;
    $scripts{$1}++;
}

$, = "\n";
say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/
+ge; qq/"$_"/ } keys %scripts;
[download]

I admit, I got lazy at this point and manually combined those files (58.lst, as well as 510.new, 512.new etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version:

# New Unicode scripts added in Perl 5.xx
my %uniscripts = (
    '8' => [
        "Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid",
        "Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret",
        "Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja
+rati",
        "Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana",
        "Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin",
        "Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O
+riya",
        "Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil",
        "Telugu", "Thaana", "Thai", "Tibetan", "Yi"
    ],
    '10' => [
        "Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif
+orm",
        "Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B",
        "New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa",
        "Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh",
        "Ugaritic"
    ],
    '12' => [
        "Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs",
        "Imperial_Aramaic", "Inscriptional_Pahlavi",
        "Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li",
        "Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik
+i",
        "Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan",
        "Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai"
    ],
    '14' => [
        "Batak", "Brahmi", "Mandaic"
    ],
    '16' => [
        "Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao",
        "Sharada", "Sora_Sompeng", "Takri"
    ],
    '18' => [
    ],
    '20' => [
    ],
);

(my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/;

my @known_scripts;
foreach (keys %uniscripts) {
    next if $ver < $_;
    push @known_scripts, @{ $uniscripts{$_} };
}

print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts
+ known.\n";
[download]

The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result:

# tentative!
    '22' => [
        "Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra
+ntha",
        "Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean",
        "Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia
+n",
        "Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau",
        "Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi"
    ],
[download]

But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics:

print "Found " . scalar keys(%scripts) . " scripts:\n";
print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b }
+ keys %scripts);
[download]

(You'll note that in the above two snippets, I'm using print rather than say, BTW. That's intentional: say is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.)

Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output:

Running on Perl v5.14.2, 95 scripts known.
Found 18 scripts:
        Arabic: 21 line(s)
        Bengali: 2 line(s)
        Cyrillic: 12 line(s)
        Devanagari: 3 line(s)
        Georgian: 1 line(s)
        Greek: 1 line(s)
        Gujarati: 1 line(s)
        Gurmukhi: 1 line(s)
        Han: 29 line(s)
        Hangul: 3 line(s)
        Hebrew: 1 line(s)
        Hiragana: 1 line(s)
        Katakana: 1 line(s)
        Latin: 647 line(s)
        Sinhala: 1 line(s)
        Tamil: 4 line(s)
        Telugu: 1 line(s)
        Thai: 1 line(s)
[download]

Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility.

What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above.

Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl documents Perl 5.6; the \p{Script=...} syntax for character classes doesn't exist yet, I think, but one could write \p{In...} instead, e.g. \p{InArabic}, \p{InTamil} and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.)

Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if @lines is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring.

But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.

Comment on Identifying scripts (writing systems) Select or Download Code


Perl-Sensitive Sunglasses
	PerlMonks