Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Meditations

( #480=superdoc: print w/replies, xml ) Need Help??

If you've discovered something amazing about Perl that you just need to share with everyone, this is the right place.

This section is also used for non-question discussions about Perl, and for any discussions that are not specifically programming related. For example, if you want to share or discuss opinions on hacker culture, the job market, or Perl 6 development, this is the place. (Note, however, that discussions about the PerlMonks web site belong in PerlMonks Discussion.)

Meditations is sometimes used as a sounding-board — a place to post initial drafts of perl tutorials, code modules, book reviews, articles, quizzes, etc. — so that the author can benefit from the collective insight of the monks before publishing the finished item to its proper place (be it Tutorials, Cool Uses for Perl, Reviews, or whatever). If you do this, it is generally considered appropriate to prefix your node title with "RFC:" (for "request for comments").

User Meditations
A Perl 3 bug in the debugger
2 direct replies — Read more / Contribute
by pemungkah
on Sep 20, 2023 at 17:25
    I recently posted about this on my blog, but it's worth a quick post here too.

    There's a fun little bug in the debugger, which you can see like this. Create a dumb little script. Anything will do.

    #!/bin/perl use strict; use warnings; print "we"; print "just"; print "need"; print "something"; print "to"; print "list";

    Now let's start up the debugger.

    perl -d zz.pl Loading DB routines from perl5db.pl version 1.77 Editor support available. Enter h or 'h h' for help, or 'man perldebug' for more help. main::(zz.pl:5): say "we"; DB<1>

    All as expected, but now:

    DB<1> l 1.2 1.2 use strict; DB<2> l 2.2 use warnings; 3.2 use feature 'say'; 4.2 5.2: say "we"; 6.2: say "just"; 7.2: say "need"; 8.2: say "something"; 9.2: say "to";

    That's kind of unexpected, but it gets better!

    DB<2> l 1.1.3.5 1.1.3.5 use strict; DB<3> l 2.1 use warnings; 3.1 use feature 'say'; 4.1 5.1: say "we"; 6.1: say "just"; 7.1: say "need"; 8.1: say "something"; 9.1: say "to";

    Why does this happen? well it goes back to commit a687059cbaf, which is the one that moves the debugger into lib in Perl 3. The pattern used to capture the line number specification is (\d\$\.)+, which matches all kinds of things, including floating-point numbers, IPv4 addresses, and other junk. The overall pattern used to parse the l command arguments changes over time, but that basic match to extract a "line number" never does.

    You may be thinking, "yeah, okay, I see that, but why does the debugger show the floating-point line number?" The reason is that the line number spec is captured as a string. THe debugger stores the source code of the current file in an array whose name is not a valid Perl variable name, and uses the line number spec captured by the l command to index it.

    When Perl indexes an array, the index value is converted to an integer if possible, because array indexes have to be integers. The line spec we have is captured and stored as a string, so when we try to index the source array with it, "1.22" becomes the integer 1, and we find line 1. The command uses the value of the index variable (remember, that's still a string!) to print the line number, and so we end up with a floating-point line number.

    Now, when we run the bare l command, the string "1.22" is still in the list command's "last line listed" variable, and Perl simply takes that variable and adds 1 to its contents to look for the next line. Since the contents are a string that looks like a floating point number, Perl converts it to a float, adds 1.0 (so we don't downgrade it from a float), and assigns that back to the current line number,so we get lines 2.22, 3.22, and so on.

    I've submitted a patch to fix this for 5.40, but it's pretty surprising that we've had this bug for 32 years!

A small step...and a giant leap for Bod
2 direct replies — Read more / Contribute
by Bod
on Sep 07, 2023 at 06:05

    My 1000th post!

    It's a little under 3 years since I created an account here on Perl Monks. For many years, I'd occasionally visited The Monastery thanks to Google leading me here when I asked a Perl-related question - which was quite frequently. But, back in November 2020, I had a few new projects going on and thought I needed to "raise my game". I had no concept of what that actually meant.

    My expectation was that I'd learn a few new coding styles, be a bit quicker and, perhaps, a bit clearer.

    What has actually happened, and continues to happen, is that my whole approach to writing code has changed...drastically!

    Allow me to illustrate by way of an example...

    Just this week I was writing some code for the admin part of my partner's website Pawsies. We need to be able to upload pictures of dogs that we look after, and it's helpful if those pictures are square so they display consistently.

    The whole website uses Template - something I discovered in The Monastery. The web scripts are in their own directory and not mixed up with the other site files, again a learning from The Monastery. The upshot is sites that are easier to navigate, easier to link to as everything is not in the cgi-bin and easier to maintain.

    However, it goes much further than that.

    In the past, if I wanted square images, I would have hard-coded the logic to produce them into the script that needed them. But this week, instead I wrote a module to do only that operation. It is where my thought process started, it was not an afterthought. The design started with deciding exactly what it was supposed to do and by jotting down how I would know if it was successful. The basis of a test!

    I then looked to see if there were any extra generalisations that could be made to make it more useful to other people or when I reuse it elsewhere. So, a resizing parameter was added to change the size of the square image and a position parameter was added to determine where abouts the square is taken from in the original image.

    Only then was the code written followed by the tests...

    Once the tests all ran fine, it was bundled up and uploaded to CPAN for all to use. Currently as a dev release so I get some test results before the production release. It's all looking good...

    Before joining The Monastery this would have been a bit of messy, but functional code locked away somewhere in a difficult-to-maintain script. I considered CPAN modules to be for other, superior "proper" coders...not for me. Now I look for the best way to do it for my needs now, my needs in future when I come to maintain the code or need similar functionality and for the needs of the wider Perl community as The Monastery has shown me that I have something to contribute as well as to learn.

    Watch out for a testing question coming soon - one of the tests for Image::Square required visually inspecting the output image and I don't know how to convert that into a usable test...see, lots more to learn and that's something I fully embrace!

    Thank you to everyone who has helped, inspired, questioned and critised me over the last 1000 post - it really is appreciated 👍

[NTF] Nice Perl ideas I have no time for
1 direct reply — Read more / Contribute
by Discipulus
on Sep 07, 2023 at 05:32
    Hello dear community,

    Being our venerable halls quiet nowadays I propose this meditation to share ideas of programs, modules and everything you want we have no time enough to develop them further.

    In my world ideas have no copyright and should instead circulate freely as there is the chance they are grasped by an enlightened soul who can squeeze the best from them.

    Even more: there are amateur programmers with nice ideas and professional ones with few ones. It is not something to complain about, we are different brains with different skills, inclinations and.. free hours :)

    I'd like to see at least some demo code for these ideas with the goal well explained as any possible path of implematation or critical parts, not just: /I'll save the world with a oneliner/.

    We can use a tag for these post like [NTF] (No Time For) and post them in reply at this post or as new Meditation.

    I'll start with a first one if are ok with this ..nice perl idea :)

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Looking over the fence
1 direct reply — Read more / Contribute
by karlgoethebier
on Aug 27, 2023 at 15:27

    Clack

    (defvar *handler* (clack:clackup (lambda (env) (declare (ignore env)) '(200 (:content-type "text/plain") ("Hello, Clack!")))))

    «The Crux of the Biscuit is the Apostrophe»

New built-in perl5.38 try/catch syntax
No replies — Read more | Post response
by eyepopslikeamosquito
on Aug 23, 2023 at 06:16

    After recently installing perl 5.38, I stumbled upon some cool improvements to Perl's built-in try/catch syntax while watching the excellent What's new in Perl v5.38 youtube talk, delivered by Paul "LeoNerd" Evans at TPRC 2023 Toronto.

    New perl 5.38 use feature 'try'

    • perl v5.34 added try/catch syntax based on CPAN module Syntax::Keyword::Try
    • perl v5.36 added "finally" blocks to try/catch, also inspired by Syntax::Keyword::Try
    • perl v5.36 added use feature 'defer', allowing you to create defer blocks that run at the time that execution leaves the block it's declared inside (which seems to be inspired by the classic RAII programming idiom)
    • use v5.38 implies use feature 'try'

    Some perldoc References

    To get a feel for how all this works in practice, I created a simple example, consisting of two files in a scratch directory, TestTry.pm and trytest.pl, shown below.

    TestTry.pm

    package TestTry; use strict; use warnings; print "TestTry: module load\n"; sub life { my $n = shift; defined($n) or die "error: no argument provided"; print "TestTry::life n='$n'\n"; $n =~ /^\d+$/ or die "input error: '$n' must consist of digits only +"; $n == 42 or die "Sadly there is no meaning in your life (n=$n) +"; print "TestTry: congrats, your life has meaning!\n"; print "TestTry::life end\n"; } 1;

    trytest.pl

    # trytest.pl - a simple test of new perl 5.38 try syntax: # Put TestTry.pm in same dir as trytest.pl and run with: # perl -I . trytest.pl # Note: use v5.38 implies use strict and warnings use v5.38; # use feature 'try'; # throws 'try/catch is experimental' warnings use experimental 'try'; use TestTry; sub do_one { my $number = shift; try { TestTry::life($number); } catch ($e) { chomp $e; print "trytest: caught '$e'\n"; } finally { print "trytest: in finally block\n"; } } print "trytest: start\n"; do_one("invalid"); do_one(13); do_one(42); print "trytest: end\n";

    Example run

    With that done, assuming you have perl 5.38 installed, you can run:

    $ perl -I . trytest.pl TestTry: module load trytest: start TestTry::life n='invalid' trytest: caught 'input error: 'invalid' must consist of digits only at + TestTry.pm line 11.' trytest: in finally block TestTry::life n='13' trytest: caught 'Sadly there is no meaning in your life (n=13) at Test +Try.pm line 12.' trytest: in finally block TestTry::life n='42' TestTry: congrats, your life has meaning! TestTry::life end trytest: in finally block trytest: end

    Summary

    I really like this new try/catch syntax and am looking forward to Perl providing built-in exception handling without having to install CPAN modules, such as Try::Tiny and TryCatch.

    Remembering the smartmatch/Switch debacle, I'm also a fan of this new gentler way of introducing experimental new features into the Perl core.

    Reference

    See Also

Handling of Unicode File Names
No replies — Read more | Post response
by NERDVANA
on Aug 22, 2023 at 23:27

    The Problem

    I have long been bothered by the problem where I read a directory name which happens to be a UTF-8 representation of unicode, then append a unicode string to that name, then try writing out to that new filename but get an error that the directory does not exist:

    $ perl -E 'mkdir("\x{100}")' $ perl -MB -E 'my @d= <*>; say B::perlstring($_) for @d' "\304\200" $ perl -E 'my ($d)= <*>; open(my $f, ">", "$d/\x{101}.txt") or die "$! +"' No such file or directory at -e line 1.

    Why? Because Perl passes the scalar to C library's 'open' and that delivers a UTF-8 encoding of the entire string, and the bytes that came from glob (and were never decoded from UTF-8) get their individual UTF-8 bytes encoded as UTF-8 characters.

    Perl expects the user to keep track of which strings are unicode and which strings are bytes, and never mix the two. In the example above, the real problem/bug is that glob returns bytes, and "$d/\x{101}.txt" is mixing bytes with unicode, producing garbage.

    While that answer is technically correct, I'm not satisfied with it, because it results an a sub-optimal user experience. A user *ought* to be able to list a directory, and have Unicode, append unicode to it, and write them back out. This process ought to be easy, instead of splattering the code with calls to encode() and decode(). Why can't we have nice things?

    (The problem is even worse on Windows, where you must configure your program to run with the UTF-8 codepage or else you get even worse garbage, since Perl internally uses the ANSI variants of the Win32 API which replaces unrepresentable characters with placeholders)

    What Does Python Do

    Python 2 had a system where unicode strings were represented differently from ascii strings, and so the solution in Python 2 was "unicode in, unicode out". In other words, if you call a directory listing with a unicode directory path, all the results come back as unicode strings. So what happens if you try reading an invalid UTF-8 sequence when you requested Unicode return values? it just returns a non-unicode string in the mix with the unicode ones.

    $ python2.7 Python 2.7.18 (default, Oct 10 2021, 22:29:32) [GCC 11.1.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> l=os.listdir(".") >>> l ['\xc4\x80'] >>> l=os.listdir(u".") >>> l [u'\u0100']
    (now write a file alongside it which is one correct UTF8 character and one non-utf8 byte)
    $ perl -MB -E 'open(my $f, ">", "\x{C4}\x{80}\x{A0}.txt") or die "$!"' $ python2.7 Python 2.7.18 (default, Oct 10 2021, 22:29:32) [GCC 11.1.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> l=os.listdir('.') >>> l ['\xc4\x80\xa0.txt', '\xc4\x80'] >>> l=os.listdir(u'.') >>> l ['\xc4\x80\xa0.txt', u'\u0100']
    So, does this API behavior result in a sensible developer experience?
    >>> open(l[1]+'/'+l[0], 'w') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0 +: ordinal not in range(128)
    The answer to "what happens when you try combining ascii directory with unicode filename" is "it doesn't let you do that". So, that saves the developer from head-scratching i/o errors, and puts the exception closer to the source of the problem.

    Unfortunately, Perl can't adopt this solution because Perl doesn't have a logical separation between Unicode and Ascii strings. (yes there is Perl's utf8 flag, but that's not a logical difference between contents of scalars. References available upon request.)

    But, in Python 3.0, all strings are unicode! (similar in some ways to perl's stance) So what did they do for this situation?

    $ python3 Python 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429] on l +inux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> l=os.listdir('.') >>> l ['&#256;\udca0.txt', '&#256;'] >>>
    So, er.... they return an invalid representation of the bytes? That is "\x{100}" followed by "\x{DCA0}" in place of the byte "\x{A0}". What is the Unicode 0xDC00 range? It's called the "Low Surrogate Area", and unicode.org says
    Low Surrogate Area Range: DC00-DFFF Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range. See http://www.unicode.org/charts/ for access to a complete list of the latest character code charts. ... For a complete understanding of high-surrogate code units low-surrogate code units, and surrogate pairs used for the UTF-16 encoding form, see the appropriate sections of the Unicode Standard
    So basically, Python 3 encodes stray non-utf8 bytes as values in a reserved-for-other-uses set of codepoints which should never appear in a real unicode string. Does it work correctly for round trips?
    >>> open(l[1]+'/'+l[0], "w") <_io.TextIOWrapper name='&#256;/&#256;\udca0.txt' mode='w' encoding='U +TF-8'> >>> l=os.listdir('\u0100') >>> l ['&#256;\udca0.txt'] ^d $ perl -E ' sub escapestr { $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr } say escapestr($_) for <\x{100}/*>' \xC4\x80/\xC4\x80\xA0.txt

    Sure enough, it round-trips those 0xDC00-0xDCFF codepoints back to the single non-unicode bytes they came from.

    What Can We Do In Perl?

    The python3 +0xDC00 solution could be used in Perl to handle non-utf8 characters in a new unicode-friendly API. But, how does this work out alongside our other APIs?

    Lets suppose we add a new feature "unicodefilenames". (hopefully we wouldn't have to type that much, and could eventually lump it in with "use v5.50")

    use feature 'unicodefilenames'; my ($d)= <*>; open(my $f, ">", "$d/\x{101}.txt") or die "$!";
    This works now. But what happens if we pass these file name to other modules in our program?
    package New; use v5.42; use feature 'unicodefilenames'; Old->foo($_) for <*>; package Old; use v5.38; sub foo($fname) { open my $fh, "<", $fname; }

    Whoops. The new unicode names get passed to a module that expects "a filename", and all filenames were previously strings of bytes, so it will get encoded as plain-old-utf8 which doesn't respect the conversion from "\xDCA0" to "\xA0". So, anyone with a european locale having lots of upper Latin-1 will end up with frequent breakage.

    What if Perl handled the "\xDC00" range specially regardless of the feature bit? This would break any old code that had been writing filenames using those characters. But nobody should ever be writing them... because it would only ever occur in a UTF-16 encoding. So the only reason anyone would legitimately want to write them was if they took a UTF-16 encoded string and then further encoded that as utf-8 and wanted it to be a filename.

    Assuming p5p decided that was an acceptable amount of back-compat breakage, what else could go wrong?

    package New; use v5.42; use feature 'unicodefilenames'; Old->foo($_) for <*>; package Old; use v5.38; sub foo($fname) { my $dir= "tmp\x{85}"; mkdir $dir or die "$!"; system("cp -a $fname $dir/$fname") == 0 or die "$!"; }

    Whoops, there are two bugs here. First, the Old module doesn't know that it is being given a unicode filename. Then, not anticipating this to be a problem, it combines that string with a non-unicode string, resulting in garbage. Then as a second problem, it shells out to a command, and the Perl interpreter has no way of knowing whether this is a "filename" situation where 0xDC00 should be re-interpreted. Keep in mind that people might have all sorts of reasons for passing invalid unicode (or utf-16 codes) as arguments to external programs. (well, maybe not, but it seems a lot more likely than passing them as filenames to filesystem APIs)

    But wait, what does Python do for passing bytes to external programs if all their strings are unicode?

    $ python3 Python 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429] on l +inux Type "help", "copyright", "credits" or "license" for more information. >>> import subprocess (Wrapped for readability) >>> subprocess.run([ 'perl','-E', 'sub escapestr { $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr } say escapestr($ARGV[0])', "\x80"]) C280 >>> subprocess.run([ 'perl','-E', 'sub escapestr { $_[0] =~ s/([^\x20-\x7E])/sprintf("\\x%02X", ord $1)/egr } say escapestr($ARGV[0])', "\x80"]) 80

    Woah! Pretty bold there, Python! If you want to pass the byte 0x80 as a parameter to an external program, you'd need to encode it as "\xDC80" in your always-unicode strings. (Or, use the Python3 "bytes" object instead of trying to carry around raw bytes inside unicode strings, which is what all the tutorials teach) Anyway, interesting and all, but I'm guessing this is a step too far for perl 5.

    So back to filenames. What can we do? It looks like the only way we can prevent bugs from erupting everywhere is to keep using strings of plain bytes, with unicode converted to UTF-8 (or perhaps encoded according to locale, if anyone ever uses non-utf8 locales anymore). But, what if we wrap filenames with objects?

    package New; use v5.36; use Path::UTiny; # imagine a unicode-aware Path::Tiny # Create directory named "\xC4\x80" path("\x{100}")->mkdir; for (path(".")->children) { # compares as unicode Old->foo($_) if $_->name eq /\x{100}/; } package Old; use v5.36; sub foo($dir) { # stringify to bytes, creates file "\xC4\x80/\x80.txt" open my $f, '>', "$dir/\x80.txt"; }

    This actually works! To be clear, I'm proposing that the path object would track unicode internally (where it could use Python3's trick of remapping the ambiguous bytes) and any time it was coerced to a string by unsuspecting legacy code, or by PerlIO API calls, it would yield the usual UTF-8 bytes.

    The downside is that you still can't write

    $path= path("$path/$unicode")
    because that would still be combining unicode with non-unicode. The ".=" operator could be overloaded to return new Path objects, but that might also surprise users when $x .= "/$y" has different results than $x= "$x/$y" so maybe not.

    Conclusion

    I don't see any practical way for Perl 5 to upgrade to unicode filenames in plain strings and native PerlIO functions. It would create about as many problems as it would solve. But, a new path object library that works with unicode internally but stringifies to bytes would have a chance of being useful for working with unicode without breaking too many common assumptions.

Let's make BBQ a Saint!
5 direct replies — Read more / Contribute
by eyepopslikeamosquito
on Aug 04, 2023 at 06:28
Perl's not dead, and neither is the community
2 direct replies — Read more / Contribute
by talexb
on Jul 21, 2023 at 11:31

    Last week, I hosted The Perl and Raku Conference (TPRC) 2023 in Toronto, Canada. We had under a hundred attendees, and we had a three day schedule of sessions with three tracks. There was also a hackathon Monday and Friday, and Dave Rolsky put on a one day course in Go on the Friday.

    I've been going to these conferences on and off for about twenty years (2000, 2001, 2002, 2012, 2019 and 2022), so I had a pretty good idea how they work. Putting on my own conference was eye-opening, but what really moved me was the impressive number of volunteers that helped out. There were just people who didn't know much about Perl who came out, but I also had speakers jump in to help with A/V setup and all kinds of other details like making up badges. It was fabulous.

    Our keynote speaker was Curtis Poe (Ovid) who talked about Cor, the new object layer that's an experimental feature in Perl 5.38 (just released). We also had Paul Evans (leonerd, the current pumpking) who gave a talk about what was new in this new version of Perl. The talks, as well as a pile of Lightning Talks are in the process of being edited together and uploaded to Youtube. And next year's conference is already planned for Las Vegas, Nevada in June, 2024.

    Yeah, Perl's an old language. But it's still alive and well. :)

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

EyeBall stumps BodBall (Error Handling)
4 direct replies — Read more / Contribute
by eyepopslikeamosquito
on Jul 06, 2023 at 20:18

    However, I will not call die. I find it frustrating when modules die.

    -- Bod in Re^6: STDERR in Test Results

    While I doubt Bod, hailing from (working-class) Coventry UK, would be permitted to enter the hallowed Long Room at Lords to hurl abuse at the Australian cricket team during the Ashes test match last weekend, I'm sure he won't be stumped by this meditation's title ... unlike monks from non-cricket-playing nations, doubtless unfamiliar with Bazball :).

    Bodball, you may recall I once scolded you for asking "what should I test?" long after you'd released your module. I similarly urge you to get into the habit of thinking long and hard about your module's error handling well before you release it, and for the same reasons. Like TDD, it's crucial to perform this error-handling analysis early because doing so will likely change your module's interface.

    Further to the excellent general advice you've already recieved from afoken, I'm interested to learn more about the errors you commonly encounter in practice when using your Business::Stripe::Webhook module. I also urge you to add an expanded "Error Handling" section to your module's documentation.

    General Error Handling Advice

    Don't fail silently. Failure is inevitable; failing to report failures is inexcusable. Failing silently causes the following problems:

    • Users wonder whether something has gone wrong. ("Why did my order not go through?")
    • Customer support wonders what caused a problem. ("The log file gave no indication of a problem")

    Embrace your software's fallibility. Assume that humans will make mistakes using your software. Try to minimize ways for people to misuse your software, but assume that you can't completely eliminate misuse. Therefore, plan error messages as you design software.

    -- General error handling advice from Google Error Messages course

    Programming Tips

    What should a function do if it cannot perform its allocated task?

    • return a value indicating failure
    • throw an exception
    • terminate the program

    Return failure when:

    • an error is normal and expected (e.g. opening a file)
    • an immediate caller can reasonably be expected to handle the failure

    Throw an exception when:

    • an error is so rare that the programmer is likely to forget to check for it
    • an error cannot be handled by the immediate caller
    • new kinds of errors are added in lower modules that higher level modules were not written to cope with
    • no suitable return path for error codes is available (e.g. semipredicate problem)
    • return path of a function is made uglier by the need to return an error indicator
    • the function that found the error was a callback
    • an error requires an "undo" action (unlike RAII say)

    This is not a black and white issue. Experience and good taste are required.

    Business::Stripe::Webhook Error Handling

    Though unfamiliar with your Business::Stripe::Webhook domain, I briefly browsed your module's documentation. Good to see you've already written a short "Errors and Warnings" section in its documentation; I suggest you improve and expand this section for the next release.

    AFAICT, your basic error handling strategy is for your methods to set the error property, for example:

    $vars{'error'} = 'Missing payload data'
    with the module user expected to check this error property after calling each method. Is that right?

    I think a clear statement of your overall error-handling strategy, combined with a couple of real-world examples of handling common errors you've experienced when using your module, would be invaluable to your users ... and may cause you to tweak your module's error-handling code and interface ... which is why this step is ideally performed well before release. :)

    See Also

    Updated: minor changes to wording were made shortly after posting. Added more references to the See Also section.

Solving the Long List is Long challenge, finally?
6 direct replies — Read more / Contribute
by marioroy
on Jul 01, 2023 at 03:59

    Chuma posted an interesting Long list is long challenge, last year October 2022. eyepopslikeamosquito created a separate thread one month later. Many solutions were provided by several folks.

    Well, I continued working on this on and off, as time permitted. My goal was keeping memory consumption low no matter if running a single thread or 20+ threads. Ideally, running more threads should run faster. It turns out that this is possible. Ditto, zero merge overhead as the keys are unique. Just move the elements from all the sub-maps over to a vector for sorting and output.

    In a nut-shell, the following is the strategy used for the hash-map solutions in latest June 2023 refresh.

    1. create many sub-maps and mutexes 2. parallel single-file via chunking versus parallel list of files 3. create a hash-value for the key and store the value with the key 4. determine the sub-map to use by hash-value MOD number-of-maps 5. there are total 963 sub-maps to minimize locking contention 6. randomness kicks in, allowing many threads to run

    I reached my goal of one copy of a given key in memory while processing "get_attributes" no matter the number of threads. The following is an extract from llil_results.txt. There are two new additions: llil4hmap and llil4emh using phmap::flat_hash_map and emhash7::HashMap, respectively.

    C) 552 files (6 * 92), fixed string length = 12 llil4vec: 0m32.568s 53.9 GB llil4map: 0m35.329s 8.8 GB llil4hmap: 0m31.339s 8.3 GB llil4emh: 0m26.155s 9.4 GB
    ############################################################ # C) 552 files (6 * 92), fixed string length = 12 # ############################################################ # Memory 53.9 GB $ NUM_THREADS=24 taskset -c 0-31 ./llil4vec \ big* big* big* big* big* big* | cksum llil4vec (fixed string length=12) start use OpenMP use boost sort get properties 11.963 secs sort properties 13.466 secs vector reduce 2.273 secs vector stable sort 1.133 secs write stdout 3.731 secs total time 32.568 secs 2511908988 1891299111 # Memory 8.8 GB $ NUM_THREADS=24 taskset -c 0-31 ./llil4map \ big* big* big* big* big* big* | cksum llil4map (fixed string length=12) start use OpenMP use boost sort get properties 29.869 secs phmap to vector 0.501 secs vector stable sort 1.260 secs write stdout 3.697 secs total time 35.329 secs 2511908988 1891299111 # Memory 8.3 GB $ NUM_THREADS=24 taskset -c 0-31 ./llil4hmap \ big* big* big* big* big* big* | cksum llil4hmap (fixed string length=12) start use OpenMP use boost sort get properties 25.893 secs hmap to vector 0.448 secs vector stable sort 1.276 secs write stdout 3.721 secs total time 31.339 secs 2511908988 1891299111 # Memory 9.4 GB $ NUM_THREADS=24 taskset -c 0-31 ./llil4emh \ big* big* big* big* big* big* | cksum llil4emh (fixed string length=12) start use OpenMP use boost sort get properties 20.733 secs emhash to vector 0.424 secs vector stable sort 1.257 secs write stdout 3.653 secs total time 26.155 secs 2511908988 1891299111

    The map solutions continue running faster given more threads. Here, the test machine has 64 logical threads.

    # Memory 9.7 GB $ NUM_THREADS=64 ./llil4emh \ big* big* big* big* big* big* | cksum llil4emh (fixed string length=12) start use OpenMP use boost sort get properties 11.563 secs emhash to vector 0.463 secs vector stable sort 1.259 secs write stdout 3.787 secs total time 17.074 secs 2511908988 1891299111

    The "llil4map" variant uses "phmap::parallel_flat_hash_map". In that demonstration, the sub-maps and mutexes are handled by the C++ library. The new "llil4hmap" and "llil4emh" variants handle sub-maps and locking at the application level. This allowed me to try alternative hash-map libraries, but requires thread safety at the application level. Not a problem. It takes just few lines of code to ensure thread-safety.

    Thank you, Gregory Popovitch. He identified the last one-off error in my C++ chunking logic plus shared a couple suggestions. See issue 198. Thank you, eyepopslikeamosquito for introducing me to C++. Thank you, anonymous monk. There, our anon-friend mentioned the word parallel. So, we tried running parallel in C++. Eventually, chunking too. :)

Fishnet is not a color
2 direct replies — Read more / Contribute
by ambrus
on Jun 21, 2023 at 09:38
Please review documentation of my AI::Embedding module
2 direct replies — Read more / Contribute
by Bod
on Jun 02, 2023 at 17:26

    Could you please take a look at the documentation for my new module and let me know if it makes sense? I always find that I am too close to the module and know what everything is supposed to do. In short, I have the Curse of Knowledge!

    Here is the documentation

    Why is it you only find typos after publishing?
    The second raw_embedding method should read test_embedding in both the heading and the sample code. I've corrected this error now.

    Thank you greatly for helping me get this right...

    Edit:

    Changed title from "RFC - Documentation Review" to "Please review documentation of my AI::Embedding module" as considered by erzuuli

Vote For Perl
3 direct replies — Read more / Contribute
by harangzsolt33
on May 29, 2023 at 16:30
CPAN namespace for AI Embedding module
1 direct reply — Read more / Contribute
by Bod
on May 28, 2023 at 17:41

    Wise Monks...

    Having written some code that uses Embeddings to compare pieces of text, I feel this would be useful to others. The Embeddings are generated from the OpenAI API at present.

    I plan to package this up into a module for CPAN and would like some advice on the namespace for this module...

    There is already OpenAI::API::Request::Embedding, which is just a thin wrapper to the API. I don't want to use the OpenAI namespace because my module will probably allow other Embedding providers to be used. For example, Hugging Face provides a cheaper but less precise Embeddings API. This may be better suited to some users.

    As well as providing the connection to the API, my module will also have a method to allow two pieces of text to be compared. More functionality than just a thin wrapper.

    I've looked at the AI namespace - e.g. AI::XGBoost. There is also Text::AI::CRM114

    As my module will connect to several different API providers, I am thinking AI::Embedding might be the right name for it but I am not convinced and your opinions and advice would be greatly appreciated.

Risque Romantic Rosetta Roman Race
7 direct replies — Read more / Contribute
by eyepopslikeamosquito
on May 10, 2023 at 03:17

    I've finally got around to extending to my long-running Perl vs C++ Performance series by timing some Roman to Decimal Rosetta PGA-TRAM code on Ubuntu.

    Generating the Test Data

    You'll need to install the Roman module from CPAN (or simply copy Roman.pm locally) to generate the test data by running:

    # gen-roman.pl use strict; use warnings; use Roman; for my $n (1..1000) { for my $i (1..3999) { my $r = int(rand(2)) ? uc(roman($i)) : lc(roman($i)); print "$r\n"; } }
    with:
    perl gen-roman.pl >t1.txt

    which will generate a test file t1.txt containing 3,999,000 Roman Numerals.

    Running the Benchmark

    With that done, you can run rtoa-pgatram.pl below (derived from Rosetta PGA-TRAM) with:

    $ perl rtoa-pgatram.pl t1.txt >pgatram.tmp
    which produced on my laptop:
    rtoa pgatram start read_input_files : 1 secs roman_to_arabic : 7 secs output : 0 secs total : 8 secs

    rtoa-pgatram.pl

    # rtoa-pgatram.pl # Example run: perl rtoa-pgatram.pl t1.txt >pgatram.tmp # # Convert a "modern" Roman Numeral to its arabic (decimal) equivalent. # The alpabetic input string may be assumed to always contain a valid +Roman Numeral in the range 1-3999. # Roman numerals may be upper or lower case. # Error handling is not required. # For example: # input "XLII" should produce the arabic (decimal) value 42 # input "mi" should produce the arabic (decimal) value 1001 use 5.010; # Needed for state use strict; use warnings; use List::Util qw(reduce); sub read_input_files { my $files = shift; # in: reference to a list of files containin +g Roman Numerals (one per line) my @list_ret; # out: reference to a list of the Roman Numer +als in the files for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; push @list_ret, uc($_); } close($fh) or die "error: close '$fname': $!"; } return \@list_ret; } # Function roman_to_arabic # Input: reference to a list of valid Roman Numerals in the range 1.. +3999 # Output: reference to a list of their arabic (decimal) values sub roman_to_arabic { my $list_in = shift; # in: reference to a list of valid Roman Nu +merals my @list_ret; # out: a list of their integer values state %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 ) +; for (@{$list_in}) { push @list_ret, reduce { $a+$b-$a%$b*2 } map { $rtoa{$_} } split +//, uc($_); } return \@list_ret; } @ARGV or die "usage: $0 file...\n"; my @rtoa_files = @ARGV; warn "rtoa pgatram start\n"; my $tstart1 = time; my $aref1 = read_input_files( \@rtoa_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "read_input_files : $taken1 secs\n"; my $tstart2 = time; my $aref2 = roman_to_arabic($aref1); my $tend2 = time; my $taken2 = $tend2 - $tstart2; warn "roman_to_arabic : $taken2 secs\n"; my $tstart3 = time; for my $n ( @{$aref2} ) { print "$n\n" } my $tend3 = time; my $taken3 = $tend3 - $tstart3; my $taken = $taken1 + $taken2 + $taken3; warn "output : $taken3 secs\n"; warn "total : $taken secs\n";

    I was relieved that this ran a little faster than rtoa-roman.pl, which is just a copy of rtoa-pgatram.pl above that uses Roman's arabic function instead of rtoa-pgatram.pl's pgatram algorithm; that is with:

    push @list_ret, reduce { $a+$b-$a%$b*2 } map { $rtoa{$_} } split//, + uc($_);
    above replaced with:
    use Roman; ... push @list_ret, arabic($_);

    $ perl rtoa-roman.pl t1.txt >roman.tmp rtoa roman start read_input_files : 1 secs roman_to_arabic : 11 secs output : 1 secs total : 13 secs $ diff roman.tmp pgatram.tmp

    Please feel free to reply with alternative Perl roman_to_arabic subroutines, especially if they are faster. Roman to Arabic subroutines in other languages are also welcome.


Add your Meditation
Title:
Meditation:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":


  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2023-10-03 03:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?