Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Regex Substring SORT Conundrum

by Polyglot (Chaplain)
on Mar 05, 2015 at 15:38 UTC ( [id://1118905]=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

I've looked at some posts here on a substring sort HERE and HERE which seem inapplicable to my case, but which do indeed present some of the difficulty of this sort (no pun intended) of task.

I'm needing to create a particular sort order for Biblical citations, based on the chronological order in which they would normally occur in the Bible. This order has no relationship to either numerical values nor to alphabetical values. Each reference must be sorted principally by the Book Title, but it must also include the chapter(s) and verse(s) (if present) which follow the book.

Here is a sample of the references needing to be sorted, and how I would hope it might look after being sorted.

Biblical References
UnSorted InputSorted Output
Acts 4:29-31 Numbers 14:10 Luke 1:20 John 16:15 Acts 2:4 1 Peter 1:22 Psalm 56:3 2 Corinthians 12:2, 4, 1, 11 Ephesians 3:18, 19 Ephesians 4:14, 13, 17, 18; 5:15, 16 Matthew 24:24 Colossians 2:6-8 Hebrews 10:35-39 Hebrews 4:10-12 Philippians 1:6, 27-29 Matthew 7:6-12, 15 Philippians 2:13-15
Numbers 14:10 Psalm 56:3 Matthew 7:6-12, 15 Matthew 24:24 Luke 1:20 John 16:15 Acts 2:4 Acts 4:29-31 2 Corinthians 12:2, 4, 1, 11 Ephesians 3:18, 19 Ephesians 4:14, 13, 17, 18; 5:15, 16 Philippians 1:6, 27-29 Philippians 2:13-15 Colossians 2:6-8 Hebrews 4:10-12 Hebrews 10:35-39 1 Peter 1:22

Ideally, the sort should be done by the book first, and secondarily by its chapter/verse values. To begin with, I created a hash identifying each book and assigning it a numerical value for sort purposes. For now, I'm content with any permutation of the book reference having an equal value. A snippet follows.

sub byBiblicalBookOrder { my %sortdata = ( 'Genesis' => '100', 'Gen' => '100', 'Ge' => '100', 'Gn' => '100', 'GEN' => '100', 'GE' => '100', 'GN' => '100', 'GENESIS' => '100', 'Exodus' => '200', 'Exo' => '200', 'Ex' => '200', 'Exod' => '200', 'EXO' => '200', 'EX' => '200', 'EXOD' => '200', 'EXODUS' => '200', 'Leviticus' => '300', 'Lev' => '300', 'Le' => '300', 'Lv' => '300', 'LEV' => '300', 'LE' => '300', 'LV' => '300', 'LEVITICUS' => '300', ... 'Revelation' => '9000', 'Rev' => '9000', 'Re' => '9000', 'The Revelation' => '9000', 'Apocalypse' => '9000', 'Apoc' => '9000', 'REVELATION' => '9000', 'REV' => '9000', 'RE' => '9000', 'THE REVELATION' => '9000', 'APOCALYPSE' => '9000', 'APOC' => '9000' ); if ($sortdata{$a} > $sortdata{$b}) { return 1 } elsif ($sortdata{$b} > $sortdata{$a}) { return -1 } else { return 0 } } # END SUB byBiblicalBookOrder

But the problem is that perl's $a and $b also include extraneous book and chapter numbers for each reference, and therefore are not matched anywhere in the hash. To try to sort by the book names only, I tried calling the script like this:

print sort byBiblicalBookOrder map{/^(.*)(?>(?:\s.{1,5}(:\d+)*.*))/} @ +references;

...which didn't work. So far, anything I've tried either yields an alphabetical sort, a truncated sort (no chap/verse), or an unsorted result.

I wish this did not challenge me so much. It shows I still have sooo much to learn before I can feel I've really come close to proper utilization of the Perl language.

Your wisdom is much appreciated!

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: Regex Substring SORT Conundrum
by choroba (Cardinal) on Mar 05, 2015 at 15:59 UTC
    Creating the hash with "weights" for chapters is definitely a good idea. Now, you just have to parse each citation:

    If the lists are very long, you might preprocess the input by so called Schwartzian transform:

    sub biblically { $order{$a->[0]} <=> $order{$b->[0]} || $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] } my @sorted = map $_->[-1], sort biblically map [/(.*?) ([0-9]+):([0-9]+)/, $_], @unsorted;

    You didn't specify how to sort citations like

    Ephesians 3:18, 19 Ephesians 3:18-21

    To handle all such cases, the biblically subroutine might get even more complicated.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you so much for your prompt and complete reply, with a working example included. I much appreciate it. I tested your example, and it worked fine. However, when I tried to adapt my code to follow your solution, I was still not seeing the results I had hoped (just couldn't seem to get it to work). I don't understand "map" very well at all, and that may be part of my weakness. Yours seemed like a good solution, and perhaps more proper, but I'll be happy with anything that works. In my case, efficiency is not important, as this is not going to run very often, so anything that works, however kludgy it might be, will suit. (Terrible thing to say, I suppose, but sometimes my time is more valuable than the computer's.)

      Blessings,

      ~Polyglot~

        Some of the results that sorted wrongly for you probably had a format different to what you showed here. Can you check that? Also, map is just a for in disguise: in this particular example, it takes the citations, and maps each onto an array reference, containing the book, chapter, verse, and in case of the ST, the citation itself to be retrieved after sorting.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Regex Substring SORT Conundrum
by BrowserUk (Patriarch) on Mar 05, 2015 at 16:11 UTC

    You didn't post a complete name/number mapping, so you'd need to do that yourself. Switching this line:

    $a[ 0 ] cmp $b[ 0 ]

    To something like:

    nameToNumber{ $a[ 0 ] } <=> nameToNumber{ $b[ 0 ] }

    should do it, but this might get you started:

    #! perl -sw use strict; print for sort { my @a = $a =~ m[(\w+)\s+(\d+)(?::(\d+))?]; my @b = $b =~ m[(\w+)\s+(\d+)(?::(\d+))?]; $a[ 0 ] cmp $b[ 0 ] or $a[ 1 ] <=> $b[ 1 ] or $2 and $a[ 2 ] <=> $b[ 2 ] or 0; } <DATA>; __DATA__ Acts 4:29-31 Numbers 14:10 Luke 1:20 John 16:15 Acts 2:4 1 Peter 1:22 Psalm 56:3 2 Corinthians 12:2, 4, 1, 11 Ephesians 3:18, 19 Ephesians 4:14, 13, 17, 18; 5:15, 16 Matthew 24:24 Colossians 2:6-8 Hebrews 10:35-39 Hebrews 4:10-12 Philippians 1:6, 27-29 Matthew 7:6-12, 15 Philippians 2:13-15

    Produces:

    C:\test>junk66 Acts 2:4 Acts 4:29-31 Colossians 2:6-8 2 Corinthians 12:2, 4, 1, 11 Ephesians 3:18, 19 Ephesians 4:14, 13, 17, 18; 5:15, 16 Hebrews 4:10-12 Hebrews 10:35-39 John 16:15 Luke 1:20 Matthew 7:6-12, 15 Matthew 24:24 Numbers 14:10 1 Peter 1:22 Philippians 1:6, 27-29 Philippians 2:13-15 Psalm 56:3

    Of course, then you can save a (tiny) bit of time switching to a ST or GRT; but there aren't enough books in the Book to worry about that.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      I have the full 90 books (including the Apocryphal ones, just in case they're ever needed) in a hash of well over 1000 lines. I didn't figure anyone would wish to peruse it for their reading pleasure, so I shortened it. However, if anyone else wants it, I suppose I could dump the whole thing to a code box here. I guess I'm not sure what the forum does with way-too-long posts. I don't think I'd want it reaped. :) (And I haven't figured out how to make such nice additions to my home node as some posters here seem to do with extra bits like this.)

      Thank you for your suggestions.

      Blessings,

      ~Polyglot~

        a hash of well over 1000 lines.

        NP. I was just explaining why I hadn't dealt with that bit. So long as you understand where and how to use your hash in my snippet, is all that matters.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Regex Substring SORT Conundrum
by Ea (Chaplain) on Mar 05, 2015 at 16:09 UTC
    Those posts really are what you need, the Schwartzian Transform. You've just got a more complicated use case. First you'll want to map all the fields that you want to sort on
    map { /^(.*)(?>(?:\s.{1,5}(:\d+)*.*))/; [$_, $sortdata{$1}, $2, $3] }
    (I haven't checked the correctness of the regex)

    then the sort becomes as easy as

    sort { $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] || $a->[3] <=> $b->[ +3] }
    and then remap the values with map { $_->[0] }. This of course is done in reverse order, but right to left is how I think of constructing the Transform.

    Does that make sense?

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.

      I guess map is where my weakness lies. If only I could understand it. There are some things I've never been able to grasp, it seems. Things like map, grep, and, above all, object references would be at the top of the list. I'm thankful that I seem to have had no trouble understanding regex. Thank you for your pointers, though; I really appreciate all the helpfulness here.

      Blessings,

      ~Polyglot~

        map is a transforming filter: it takes the right side (after the code block), transforms it according to the code block, and spits the transformed pieces out the other side. Think of it like an assembly line:

        • a bin of widgets is dumped into the hopper at the beginning of the line
        • A whatzit is added to the widget to make a whatever
        • a bin of whatevers is left at the end of the line.

        Does that help explain it at all?

        --MidLifeXis

        It's definitely worth the effort. Let me see if I can break down the map I gave (and maybe clean it up a bit). Starting from
        map { /^(.*)(?>(?:\s.{1,5}(:\d+)*.*))/; [$_, $sortdata{$1}, $2, $3] }
        you should read it like
        map { my ($book, $chapter, $verse) = /^(.*)(?>(?:\s.{1,5}(:\d+)*.*))/; + [$_, $sortdata{$book}, $chapter, $verse] }
        (I'm taking for granted that your regex works like that) For each element of your list, it
        1. assigns the element to $_
        2. applies the regex and assigns the captures to ($book, $chapter, $verse)
        3. creates an anonymous array [ ] with the original element at index 0 and the other 3 which you'll use in the sort
        4. passes it on to the next process
        so from your first list, you get a list of anonymous arrays for your sort which you can reference using $_->2 for the chapter, etc. Also remember that $a->[1] <=> $b->[1] sorts numerically while $a->[1] cmp $b->[1] sorts alphabetically. The map on the other side of the sort takes the list of sorted anonymous arrays and reduces it to the original elements, now sorted.

        A better explanation of the Transform is in the Modern Perl book

        Sometimes I can think of 6 impossible LDAP attributes before breakfast.
Re: Regex Substring SORT Conundrum
by bitingduck (Chaplain) on Mar 05, 2015 at 16:08 UTC

    You can extract the book name by itself with a regex, and compare using the regexes as keys for the hash:

    $a=~/^((\d+ )?\w+)/ $abook=$1; $b=~/^((\d+ )?\w+)/; $bbook=$1; if ($sortdata{$abook} > $sortdata{$bbook}) { return 1 } elsif ($sortdata{$bbook} > $sortdata{$abook}) { return -1 } else { return 0 } } # END SUB byBiblicalBookOrder

    This doesn't affect the contents of $a or $b themselves, so you can then do the additional sorting based on regexes that pull out the trailing sets of numbers.

    edit: if you just want the book names and not the numbers, the regex would be something like /^((\d+ )?(\w+))/ and you'd take the value of $3 to get the book name without the leading number.

      This worked for me. It only sorted to the book level, so I still need to play with the numbers afterward. But this is a great start. Thank you so much!

      Blessings,

      ~Polyglot~

Re: Regex Substring SORT Conundrum
by hdb (Monsignor) on Mar 05, 2015 at 22:28 UTC

    Once you have extracted the book and the chapter, you can translate the two together into a single number by multiplying the book rank by 10000 say plus the chapter. Now use the Schwartzian to sort the citations:

    use strict; use warnings; my $ctr=0; my %books=map{$_=>++$ctr}( 'Genesis','Exodus','Leviticus','Numbers','Deuteronomy','Joshua', 'Judges','Ruth','1 Samuel','2 Samuel','1 Kings','2 Kings', '1 Chronicles','2 Chronicles','Ezra','Nehemiah','Esther', 'Job','Psalm','Proverbs','Ecclesiastes','Song of Solomon', 'Isaiah','Jeremiah','Lamentations','Ezekiel','Daniel', 'Hosea','Joel','Amos','Obadiah','Jonah','Micah','Nahum', 'Habakkuk','Zephaniah','Haggai','Zechariah','Malachi', 'Matthew','Mark','Luke','John','Acts','Romans','1 Corinthians', '2 Corinthians','Galatians','Ephesians','Philippians', 'Colossians','1 Thessalonians','2 Thessalonians','1 Timothy', '2 Timothy','Titus','Philemon','Hebrews','James','1 Peter', '2 Peter','1 John','2 John','3 John','Jude','Revelation'); my @sorted= map{$_->[0]} sort{$a->[1]<=>$b->[1]} map{chomp;/^(\d?\s?\w+)\s+(\d+)/;[$_,10000*$books{$1}+$2]} <DATA>; print "$_\n" for @sorted; __DATA__ Acts 4:29-31 Numbers 14:10 Luke 1:20 John 16:15 Acts 2:4 1 Peter 1:22 Psalm 56:3 2 Corinthians 12:2, 4, 1, 11 Ephesians 3:18, 19 Ephesians 4:14, 13, 17, 18; 5:15, 16 Matthew 24:24 Colossians 2:6-8 Hebrews 10:35-39 Hebrews 4:10-12 Philippians 1:6, 27-29 Matthew 7:6-12, 15 Philippians 2:13-15

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1118905]
Approved by karlgoethebier
Front-paged by karlgoethebier
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-26 03:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found