AppleFritter

Howdy, partner! Name's Apple Fritter, pleasure to meet y'all! I use Perl, but I don't know that much about it (yet). I'm trying to change that, so I frequent the Monastery, reading others' answers and code to learn, and providing my own answers and code to hone my skills.

If I come across useful advice, tips, modules, code snippets, articles etc., I usually add it to my home node (which you are reading right now) for future reference. Maybe you'll find it useful, too!

Note: I'm not active on Perlmonks anymore. I may still update my home node when I come across items worth adding.

N.B. when crossposting to several sites, it is considered polite to inform readers of this and provide links to avoid unnecessary/duplicated effort.

Posts by AppleFritter

Safely capturing the output of an external program in Seekers of Perl Wisdom 4 direct replies — Read more / Contribute	by AppleFritter on Mar 08, 2020 at 19:52

Esteemed monks, I'm sure this has been asked (and answered before), but I can't seem to find said question. I'd like to call from within Perl an external program, passing it some arguments, and capture its output. Usually I'd reach for backticks or the `qx//` operator, but the arguments that need to be passed come from user-supplied data, and while the program being called itself should be safe to invoke, there's the issue of the shell and its shenanigans. To give a bit more context, I'm working with a TeX installation and need to call kpsewhich (a wrapper around the kpathsea library, which will help you locate various files that TeX will make use of). So I'd want to get the output of, say, `kpsewhich cmr10.tfm`; but the name of the file I'm looking up comes from a user-supplied file I have no control over, and I'd rather not feed `kpsewhich cmr10.tfm ; evil_things_go_here` to the shell. (You get the idea.) As far as I'm aware `system` and `exec` have "safe" invocations that will avoid the shell (even on braindead OSes, like Windows). Does `qx//`? Or for that matter, is there another (different, possibly better) way to locate TeX's files? A Perl wrapper for the kpathsea library, perhaps? (This manpage hints that such a thing exists, but it's not on CPAN AFAICT.) Thanks.
Accessing SQLite databases within ZIP files in Seekers of Perl Wisdom 7 direct replies — Read more / Contribute	by AppleFritter on Oct 01, 2017 at 07:22

Dearest life forms lurking in the Monastery! I'm trying to process resource files produced by a third-party application. These resource files are actually ZIP files containing, among other things: an SQLite database; a bunch of binary blobs (stored as file entries in the ZIP archive, rather than as BLOBs in the SQLite DB); and a JSON file mapping resource identifiers used in the DB to the binaries' filenames. I'd like to access all this data. I'd also like to do this in the easiest, DWIMiest, most natural manner possible. The most straightforward way is of course to extract the ZIP file, and then use DBI, JSON::XS and whatever modules are appropriate to handle the binaries (images, sounds, videos etc). But I'd like to avoid this, if possible; I want to be able to point my script at the ZIP file without having to worry about disk space, clean-up, and all that. There's a variety of modules on CPAN for transparently handling ZIP archives (in fact, IO::Uncompress::Unzip is in core). What I have not found is a way of accessing a database without extracting it to disk first. More precisely, what I'd like to do is either: have DBD::SQLite read the DB directly from the ZIP file, using some kind of transparent intermediary layer; or extract the DB into memory (i.e. a Perl scalar), and then have DBD::SQLite read that. I only need to read the DB, BTW, not modify it, so any complications to do with putting modifications back into the ZIP can safely be ignored. So, my question is: is this possible, using only existing CPAN modules? A cursory search didn't reveal anything useful.
Faster alternative to Math::Combinatorics in Seekers of Perl Wisdom 6 direct replies — Read more / Contribute	by AppleFritter on Sep 01, 2017 at 09:20

Oh monks of the round table, who dance whene'er they're able, who dine well here in Camelot and eat ham and jam and spam a lot! Can someone recommend a faster alternative to Math::Combinatorics, or maybe suggest a better way of doing the following? I'm trying to generate all multisets (bags) of a specific total "weight" (let's call it w), where each element comes from a given list (of numbers, in this case), and each list element may have multiplicity 0..w in each multiset. In other words, what I'm trying to generate is a list of w-tuples of elements of the given list — but unordered tuples rather than ordered ones. An example may be instructive. Let's say w is 4, and the list is (0, 2, 3). Then I'd like to get the following multisets: `0,0,0,0 0,0,0,2 0,0,0,3 0,0,2,2 0,0,2,3 0,0,3,3 0,2,2,2 0,2,2,3 0,2,3,3 0,3,3,3 2,2,2,2 2,2,2,3 2,2,3,3 2,3,3,3 3,3,3,3` [download] (The order in which the multisets itself are generated isn't important to me either, BTW. I've only listed them in order for the sake of readability.) Not wanting to implement this myself, I turned to CPAN and found Math::Combinatorics. This works, but it's fairly slow. Here's a (slightly simplified) excerpt from my code: `#!/usr/bin/perl use Modern::Perl '2015'; use Math::Combinatorics; my $states = 4; foreach my $count (1, 2, 3, 4, 7, 8) { say "count=$count"; my $iter = Math::Combinatorics->new( count => $count, data => [ grep { $_ != 1 } (0 .. ($states - 1)) ], frequency => [($count) x ($states - 1)] ); while(my @states = $iter->next_multiset) { say join(",", @states); } }` [download] This produces the desired output, but it takes almost 90 seconds to run for `$states = 4`, and much longer for 5 and up: Read more... (2 kB) 90 seconds wouldn't be so bad, since this is part of a larger script to generate datafiles that only really needs to be run once (to generate the file). But I'd rather not spend days waiting for it to finish for higher values of `$states`. Any suggestions? Like I said, I'd prefer to stick to CPAN, but I'll take what I can get. Thanks!
Size-limited, fitness-based lists in Cool Uses for Perl 3 direct replies — Read more / Contribute	by AppleFritter on Aug 08, 2015 at 19:05

Monks and monkettes! I recently found myself wondering, what's the longest words in the dictionary (`/usr/share/dict`, anyway)? This is easily found out, but it's natural to be interested not just in the longest word but (say) the top ten. And when your dictionary contains (say) eight words of length fifteen and six words of length fourteen, it's also natural to not want to arbitrarily select two of the latter, but list them all. I quickly decided I needed a type of list that would have a concept of the fitness of an item (not necessarily the length of a word), and try not to exceed a maximum size if possible (while retaining some flexibility). My CPAN search-fu is non-existent, but since it sounded like fun, I just rolled my own. Here's the first stab at what is right now called `List::LimitedSize::Fitness` (if anyone's got a better idea for a name, please let me know): Read more... (8 kB) This features both "flexible" and "strict" policies. With the former, fitness classes are guaranteed to never lose items, but the list as a whole might grow beyond the specified maximum size. With the latter, the list is guaranteed to never grow beyond the specified maximum size, but fitness classes might lose items. (Obviously you cannot have it both ways, not in general.) Here's an example of the whole thing in action: Read more... (1091 Bytes) This might output (depending on your dictionary): $ perl longestwords.pl wordsEn.txt .......... length 21 antienvironmentalists antiinstitutionalists counterclassification electroencephalograms electroencephalograph electrotheraputically gastroenterologically internationalizations mechanotheraputically microminiaturizations microradiographically length 22 counterclassifications counterrevolutionaries electroencephalographs electroencephalography length 23 disestablismentarianism electroencephalographic length 25 antidisestablishmentarian length 28 antidisestablishmentarianism 19 words total (10 requested). $ [download] If you've got any thoughts, tips, comments, rotten tomatoes etc., send them my way! (...actually, forget about the rotten tomatoes.) Also, does anyone think this module would be useful to have on CPAN, in principle if not in its current state?
Resetting a flip-flop operator in Seekers of Perl Wisdom 1 direct reply — Read more / Contribute	by AppleFritter on Aug 06, 2015 at 06:52

Greetings, esteemed monks! Allow this humble pony to drink the sweet nectar of knowledge from the font of your collective wisdom. (Or alternatively, how 'bout some hard cider?) I need to read a number of files. In each file, each line holds a piece of data, or a marker indicating the beginning or end of a section; I'm interested only in data in a specific section. Normally, I'd do something like this: `foreach my $HANDLE (@HANDLES) { while(<$HANDLE>) { chomp; next unless /^PP_START$/ .. /^PP_END$/; # process line } }` [download] However, it turns out that in these log files, the section end marker may be omitted if there is no following section: the end of the file itself indicates the end of the section then. This wreaks havoc with the above logic, as the flip-flop operator, not having seen the marker, still evaluates to true when the outer loop moves on to the next file, and wrongly causes lines before the start marker in that file to be processed. Of course it would be trivial to add a flag indicating whether I'm in the right section, and reset that for each file. But doing that would essentially manually emulate the flip-flop operator, which strikes me as less than elegant. So I'm wondering -- is there a way to "reset" the flip-flop operator, as it were, so that it starts returning false again at the beginning of each new file? Read more... (2 kB)
"Unrecognized character" while use utf8 is in effect in Seekers of Perl Wisdom 2 direct replies — Read more / Contribute	by AppleFritter on Apr 17, 2015 at 06:03

Oh monks most tawny and tangy, whose wisdom and knowledge of all things Perl is unalienable and indefeasible, help me out, for I'm very much missing the obvious. As you will well know, Perl allows Unicode characters in variable names, so long as `use utf8;` is in effect. So the following snippet works as expected (apologies for the unresolved HTML entities, Perlmonks itself does not handle Unicode properly): `my $人 = "World"; say "Hello, $人";` [download] However, the following does not: `my $&#1F310; = "World"; say "Hello, $&#1F310;";` [download] Perl 5.20.0 complains about this, saying: `Unrecognized character \x{1f310}; marked by <-- HERE after my $<-- + HERE near column 5 at 1123740.pl line 9.` [download] This is even though the character is in Unicode 6.3.0, which Perl 5.20.0 supports. So why isn't it working? Help me out, fellow monks.
perl 5.21.10 released in Perl News 1 direct reply — Read more / Contribute	by AppleFritter on Mar 20, 2015 at 17:21

Perl 5.21.10, another development release, came out on March 20th (that's today!). Get it on CPAN or on metaCPAN while it's hot! And here's the perldelta as well: Read more... (17 kB) (This my first time posting a piece of Perl news. If I broke anything, e.g. a link, please `/msg` me and I'll fix it.)
Identifying scripts (writing systems) in Cool Uses for Perl 2 direct replies — Read more / Contribute	by AppleFritter on Sep 16, 2014 at 17:32

Dear monks and nuns, priests and scribes, popes and antipopes, saints and stowaways lurking in the monastery, lend me your ears. (I promise I'll return them.) I'm still hardly an experienced Perl (user\|programmer\|hacker), but allow me to regale you with a story of how Perl has been helping me Get Things Done™; a Cool Use for Perl, or so I think. I was recently faced with the problem of producing, given a number of lines each written in a specific script (i.e. writing system; Latin, Katakana, Cyrillic etc.), a breakdown of scripts used and how often they appeared. Exactly the sort of problem Perl was made for - and thanks to regular expressions and Unicode character classes, a breeze, right? I started by hardcoding a number of scripts to match my snippets of text against: `my %scripts; foreach (@lines) { my $script = m/^\p{Script=Latin}$/ ? "Latin" : m/^\p{Script=Cyrillic}$/ ? "Cyrillic" : m/^\p{Script=Han}$/ ? "Han" : # ... "(unknown)"; $scripts{$script}++; }` [download] Obviously there's a lot of repetition going on there, and though I had a list of scripts for my sample data, I wasn't sure new and uncontemplated scripts wouldn't show up in the future. So why not make a list of all possible scripts, and replace the hard-coded list with a loop? `my %scripts; LINE: foreach my $line (@lines) { foreach my $script (@known_scripts) { next unless $line =~ m/^\p{Script=$script}$/; $scripts{$script}++; next LINE; } $scripts{'(unknown)'}++; }` [download] So far, so good, but now I needed a list of the scripts that Perl knew about. Not a problem, I thought, I'll just check perluniprops; the list of properties Perl knows about was staggering, but I eventually decided that any property of the form "`\p{Script: ...}`" would qualify, so long as it had short forms listed (which I took as an indication that that particular property was the "canonical" form for the script in question). After some reading and typing and double-checking, I ended up with a fairly long list: `my @known_scripts = ( "Arabic", "Armenian", "Avestan", "Balinese", "Bamum", "Batak", "Bengali", "Bopomofo", "Brahmi", "Br +aille", "Buginese", "Buhid", "Canadian_Aboriginal", "Carian", "Chakma", "Cham", "Cherokee", "Coptic", "Cuneiform", "Cypriot", "Cyrillic", # ... );` [download] Unfortunately, when I ran the resulting script, Perl complained: `Can't find Unicode property definition "Script=Chakma" at (...) line ( +...)` [download] What had gone wrong? Versions, that's what: I'd looked at the perluniprops page on perl.org, documenting Perl 5.20.0, but this particular Perl was 5.14.2 and didn't know all the scripts that the newer version did, thanks to being built against an older Unicode version. Now, I could've just looked at the locally-installed version of the same perldoc page, but - wouldn't it be nice if the script automatically adapted itself to the Perl version it ran on? I sure reckoned it'd be. What scripts DID the various Perl versions recognize, anyway? What I ended up doing (perhaps there's an easier way) was to look at `lib/unicore/Scripts.txt` for versions 5.8, 5.10, ..., 5.20 in the Perl git repo (I skipped 5.6 and earlier, because a) the relevant file didn't exist in the tree yet back then, and b) those versions are ancient, anyway). And by "look at", I mean download (as `scripts-58.txt` etc.), and then process: `$ for i in 8 10 12 14 16 18 20; do perl scripts.pl scripts-5$i.txt >5$ +i.lst; done $ for i in 8 10 12 14 16 18; do diff --unchanged-line-format= --new-li +ne-format=%L 5$i.lst 5$((i+2)).lst >5$((i+2)).new; done $` [download] `scripts.pl` was a little helper script to extract script information (apologies for the confusing terminology, BTW): `#!/usr/bin/perl use strict; use warnings; use feature qw/say/; my %scripts; while(<>) { next unless m/; ([A-Za-z_]) #/; $scripts{$1}++; } $, = "\n"; say sort { $a cmp $b } map { $_ = ucfirst lc; $_ =~ s/(?<=_)(.)/uc $1/ +ge; qq/"$_"/ } keys %scripts;` [download] I admit, I got lazy at this point and manually combined those files (`58.lst`, as well as `510.new`, `512.new` etc.) into a hash holding all the information, instead of having a script output it. Nonetheless, once this was done, I could easily load all the right scripts for a given Perl version: # New Unicode scripts added in Perl 5.xx my %uniscripts = ( '8' => [ "Arabic", "Armenian", "Bengali", "Bopomofo", "Buhid", "Canadian_Aboriginal", "Cherokee", "Cyrillic", "Deseret", "Devanagari", "Ethiopic", "Georgian", "Gothic", "Greek", "Guja +rati", "Gurmukhi", "Han", "Hangul", "Hanunoo", "Hebrew", "Hiragana", "Inherited", "Kannada", "Katakana", "Khmer", "Lao", "Latin", "Malayalam", "Mongolian", "Myanmar", "Ogham", "Old_Italic", "O +riya", "Runic", "Sinhala", "Syriac", "Tagalog", "Tagbanwa", "Tamil", "Telugu", "Thaana", "Thai", "Tibetan", "Yi" ], '10' => [ "Balinese", "Braille", "Buginese", "Common", "Coptic", "Cuneif +orm", "Cypriot", "Glagolitic", "Kharoshthi", "Limbu", "Linear_B", "New_Tai_Lue", "Nko", "Old_Persian", "Osmanya", "Phags_Pa", "Phoenician", "Shavian", "Syloti_Nagri", "Tai_Le", "Tifinagh", "Ugaritic" ], '12' => [ "Avestan", "Bamum", "Carian", "Cham", "Egyptian_Hieroglyphs", "Imperial_Aramaic", "Inscriptional_Pahlavi", "Inscriptional_Parthian", "Javanese", "Kaithi", "Kayah_Li", "Lepcha", "Lisu", "Lycian", "Lydian", "Meetei_Mayek", "Ol_Chik +i", "Old_South_Arabian", "Old_Turkic", "Rejang", "Samaritan", "Saurashtra", "Sundanese", "Tai_Tham", "Tai_Viet", "Vai" ], '14' => [ "Batak", "Brahmi", "Mandaic" ], '16' => [ "Chakma", "Meroitic_Cursive", "Meroitic_Hieroglyphs", "Miao", "Sharada", "Sora_Sompeng", "Takri" ], '18' => [ ], '20' => [ ], ); (my $ver = $^V) =~ s/^v5\.(\d+)\.\d+$/$1/; my @known_scripts; foreach (keys %uniscripts) { next if $ver < $_; push @known_scripts, @{ $uniscripts{$_} }; } print STDERR "Running on Perl $^V, ", scalar @known_scripts, " scripts + known.\n"; [download] The number of scripts Perl supports this way WILL increase again soon, BTW. Perl 5.21.1 bumped the supported Unicode version to 7.0.0, adding another bunch of new scripts as a result: `# tentative! '22' => [ "Bassa_Vah", "Caucasian_Albanian", "Duployan", "Elbasan", "Gra +ntha", "Khojki", "Khudawadi", "Linear_A", "Mahajani", "Manichaean", "Mende_Kikakui", "Modi", "Mro", "Nabataean", "Old_North_Arabia +n", "Old_Permic", "Pahawh_Hmong", "Palmyrene", "Pau_Cin_Hau", "Psalter_Pahlavi", "Siddham", "Tirhuta", "Warang_Citi" ],` [download] But that's still in the future. For now I just tested this on 5.14.2 and 5.20.0 (the two Perls I regularly use); it worked like a charm. All that was left to do was outputting those statistics: `print "Found " . scalar keys(%scripts) . " scripts:\n"; print "\t$_: " , $scripts{$_}, " line(s)\n" foreach(sort { $a cmp $b } + keys %scripts);` [download] (You'll note that in the above two snippets, I'm using `print` rather than `say`, BTW. That's intentional: `say` is only available from Perl 5.10 on, and this script is supposed to be able to run on 5.8 and above.) Fed some sample data that I'm sure Perlmonks would mangle badly if I tried to post it, this produced the following output: `Running on Perl v5.14.2, 95 scripts known. Found 18 scripts: Arabic: 21 line(s) Bengali: 2 line(s) Cyrillic: 12 line(s) Devanagari: 3 line(s) Georgian: 1 line(s) Greek: 1 line(s) Gujarati: 1 line(s) Gurmukhi: 1 line(s) Han: 29 line(s) Hangul: 3 line(s) Hebrew: 1 line(s) Hiragana: 1 line(s) Katakana: 1 line(s) Latin: 647 line(s) Sinhala: 1 line(s) Tamil: 4 line(s) Telugu: 1 line(s) Thai: 1 line(s)` [download] Problem solved! And not only that, it's futureproof now as well, adapting to additional scripts in my input data, and easily extended when new Perl versions support more scripts, while maintaining backward compatibility. What could still be done? Several things. First, I should perhaps find out if there's an easy way to get this information from Perl, without actually doing all the above. Second, while Perl 5.6 and earlier aren't supported right now, they could be. Conveniently, the 3rd edition of Programming Perl* documents Perl 5.6; the `\p{Script=...}` syntax for character classes doesn't exist yet, I think, but one could write `\p{In...}` instead, e.g. `\p{InArabic}`, `\p{InTamil}` and so on. Would this be worth it? Not for me, but the possibility is there if someone else ever had the need to run this on an ancient Perl. (Even more ancient Perls may not have the required level of Unicode support for this, though I wouldn't know for sure.) Lastly, since the point of this whole exercise was to identify writing systems used for snippets of text, there's room for optimization. Perhaps it would be faster to precompile a regular expression for each script, especially if `@lines` is very large. Most of the text I'm dealing with is in the Latin script; as such, I should perhaps test for that before anything else, and generally try to prioritize so that lesser-used scripts are pushed further down the list. Since I'm already keeping a running total of how often each script has been seen, this could even be done adaptively, though whether doing so would be worth the overhead in practice is another question, one that could only be answered by measuring. But neither speed nor support for ancient Perls is crucial to me, so I'm done. This was a fun little problem to work on, and I hope you enjoyed reading about it.

Username:
Password:


go ahead... be a heretic
	PerlMonks

This is PerlMonks "Mobile"