http://qs321.pair.com?node_id=1201500

QM has asked for the wisdom of the Perl Monks concerning the following question:

I ran across a blog post about an interesting word characteristic, and wondered if a regex can be written to match this (e.g., with the plan to search a dictionary file). I suspect the answer is no, without invoking the magic "code in a regex".

Unfortunately, I don't have time now to try my hand at it, but I thought I'd post it here for everyone to have a go.

Update: Possible puzzles:

1) Find the longest words where each letter used has a different frequency.
2) Find the longest words where letter frequencies are sequential (e.g., 3,4,5,6).
3) Find the longest words where letter frequencies are sequential starting from 1.

Use any dictionary you like. If you have a mahvelous dictionary, drop a link here.

-QM
--
Quantum Mechanics: The dreams stuff is made of

  • Comment on Puzzle Regex: Letter Frequency Arithmetic Sequence

Replies are listed 'Best First'.
Re: Puzzle Regex: Letter Frequency Arithmetic Sequence
by choroba (Cardinal) on Oct 17, 2017 at 15:31 UTC
    Not sure about regexes, but counting frequencies means hashes.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $dict = '/var/lib/dict/words'; # YMMV sub each_char_has_different_freq { my ($r, $f) = @_; keys %$r == keys %$f } sub frequencies_form_a_sequence { my ($r) = @_; keys %$r == grep $r->{$_}, 1 .. keys %$r } my @linenlessnesses; open my $IN, '<', $dict or die $!; while (<$IN>) { chomp; next if /\W/; my %freq; $freq{$_}++ for split //; my %r = reverse %freq; push @linenlessnesses, $_ if each_char_has_different_freq(\%r, \%freq) && frequencies_form_a_sequence(\%r); } say for sort { length $a <=> length $b } @linenlessnesses;

    Interesting output (the longest words) on my machine:

    deadheaded evennesses keennesses peppertree rememberer restresses sanenesses sereneness sleeveless sussararas susurruses

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      "counting frequencies means hashes" - not always :)

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1201500 use strict; use warnings; my $file = '/usr/share/dict/words'; # YMMV open my $fh, '<', $file or die "$! opening $file"; while(<$fh>) { /^[a-z]{6,}$/i or next; my $word = $_; my @counts; $counts[ s/$&//gi ] .= $& while s/.//; grep(!$_ || /../, @counts) or print $word; }

      I played a lot with your solution to understand it well, so first of all - thank you, choroba.

      Just my two cents: your function frequencies_form_a_sequence assumes that the frequencies start with one. The words with letter sequencies 2,3,4 etc. will not be selected then.

      Should it not be in the following form (min, max and all are from List::Util and List::MoreUtils):

      sub frequencies_form_a_sequence { my ($r) = @_; my $min = min keys %$r; return 0 if $min == 1; # Just for test. my $max = max keys %$r; return 0 if $min == $max; # Probably not a sequence. my $bool = all { defined $r->{$_} } $min .. $max; }

      I find the following words then (cannot pretend to know what they mean though :) ):

      addda ajaja alala anana arara cocco essee esses igigi nanna pappa peepe reree susus taata tatta ululu xxiii deedeed sasararas sassarara
        Why not, before the update, the challenge wasn't exactly specified, so I understood it as 1 .. n.

        I'd just use exists instead of defined in the all block.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Puzzle Regex: Letter Frequency Arithmetic Sequence
by LanX (Saint) on Oct 17, 2017 at 15:42 UTC
    The link you gave shows an example but doesn't phrase criteria, which is unfortunate for a puzzle.

    Please see How do I post a question effectively?

    Supposing the frequency of letters has to be an ascending sequence. ..

    I'm not aware of a possibility to sum the count of matches in pure ² m// regexes so probably it's possible in tr// but those can't backtrack.

    Probably if you code 1..n matches for a fixed n into your regex and combine n look-aheads you can built an and condition. *

    This wouldn't work for arbitrary big n though.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    *) that is find one letter exactly once and find one letter exactly twice and so on.

    already the first term is too tricky for me...

    ²) no Perl embedded

    update

    First step: What is a regex to find one letter exactly once?

    update

    Waiting for tybalt89 ... tick tock tick tock ... ;)

      A regex to find if a letter occurs exactly $n times.

      Is this what you were looking for as a first step?

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1201500 use strict; use warnings; my @words = glob '{i,x}' x 6; print "@words\n"; my $n = 1; # number of times a letter occurs in a word my @oneletter = grep /^(?| (?=.*?(.)(?!.*\1)) (?: (?:(?!\1).)* \1 ){$n} (?:(?!\1).)* | (?=.*(.)) (?: (?:(?!\1).)* \1 ){$n} (?:(?!\1).)* )$ /x, @words; print "\n@oneletter\n\n", scalar @oneletter, "\n";

      Update: fails for some test cases

        Can you share some interesting test cases?

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      Too much cheating?

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1201500 use strict; use warnings; my $file = '/usr/share/dict/words'; open my $fh, '<', $file or die "$! opening $file"; chomp( my @words = grep /^[a-z]{5,}$/, <$fh>); my @good = grep frequencysequence($_), @words; print "@good\n\n" . @good . "\n"; sub frequencysequence { for my $n ( 1 .. ((-1 + sqrt 1 + 8 * length) / 2) =~ s/^(\d+)\..*/$1 + + 1/er ) { "@_\n@_" =~ / (.).* \n (?: (?:(?!\1).)* \1 ){$n} (?:(?!\1).)* $ /x or return 0; } return 1; }

      On my system outputs:

      acacia allele assays banana baobab bedded bonobo bowwow cocoon deadhea +ded deemed doodad eddied eerier effete fesses heeded horror hubbub in +ning lessee lollop mammal manana messes needed papaya peeped peeper p +epped pepper pippin powwow reefer revere rococo salsas seeded senses +sereneness settee sleeveless tattoo teeter teethe wedded weeded xxvii +i xxxvii 49
        Nice solution of the "variable length lookbehind" not allowed problem!

        But why \n instead of $ ? *

        edit

        I suppose it's necessary because the following assertion wouldn't be executed?

        update

        *) well, I just noticed the "@_\n@_" part where you are doubling the input, which explains the "\n" and some other confusion about pos.

        I agree, "too much cheating!" ;-p

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

      ...sum the count of matches...

      There's always the brute force method of generating every possible regex, assuming arithmetic sequence starting at 1. For a max frequency of 6, this covers 21 letter words. It makes me wonder what the breaking point is of my machine.

      Note that in (3), only lengths that are triangular will do.

      I'll have to find some time to chase this -- perhaps tomorrow.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        I just remembered that Perl supports recursive regexes and relative back-references, so you might not be limited to a hard limit.

        Unfortunately I don't have the time to play with it.

        FWIW if using s///g was allowed you just needed a pattern which repeatedly deleted one character of each group, iff exactly one of them is unique.

        Your criterion was met, if the string is empty afterwards.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Puzzle Regex: Letter Frequency Arithmetic Sequence
by kcott (Archbishop) on Oct 18, 2017 at 06:06 UTC

    G'day QM,

    TMTOWTDI. Here's "pm_1201500_dict_char_ordered_count.pl":

    #!/usr/bin/env perl use strict; use warnings; use constant MIN_LENGTH => 10; # 1+2+3+4 RECORD: while (<>) { next RECORD unless MIN_LENGTH < length; next RECORD unless /^([a-z]+)$/; my %char_count_of; $char_count_of{substr $1, $_, 1}++ for 0 .. length($1) - 1; my $check = 1; for (sort { $a <=> $b } values %char_count_of) { next RECORD unless $_ eq $check++; } print; }

    Looking for words with ten characters or more:

    $ pm_1201500_dict_char_ordered_count.pl /usr/share/dict/words beerbibber chachalaca isoosmosis kotukutuku rememberer sereneness sleeveless

    Changing MIN_LENGTH:

    use constant MIN_LENGTH => 6; # 1+2+3

    I get all of those ten-character words, interspersed amongst over a hundred six-character words, starting with

    $ pm_1201500_dict_char_ordered_count.pl /usr/share/dict/words | head - +5 allele amamau ananas ananda annona

    and ending with

    $ pm_1201500_dict_char_ordered_count.pl /usr/share/dict/words | tail - +5 ubussu venene wedded weeded weewow

    — Ken

Re: Puzzle Regex: Letter Frequency Arithmetic Sequence -- oneliner
by Discipulus (Canon) on Oct 18, 2017 at 08:33 UTC
    ThereIsMoreThanOne.. oneliner to do this!

    Me too I'd use hash instead of regexes.

    The problem is that you'll end with deadheaded rococo messes or with at least with an ugly fruit salad of ananas patata papaya batata acacia pepper

    perl -MList::Util="all" -lne "next unless/.../;my(%r,$c);$r{$_}++for(/ +./g);print if all{++$c==$_}sort values %r" linux.words # or minimum 10 char words only perl -MList::Util="all" -lne "next unless/.{10,}/;my(%r,$c);$r{$_}++fo +r(/./g);print if all{++$c==$_}sort values %r" linux.words

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Puzzle Regex: Letter Frequency Arithmetic Sequence
by tybalt89 (Monsignor) on Oct 20, 2017 at 14:26 UTC

    Just a tiny little cheat :)

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1201500 use strict; use warnings; $| = 1; my $file = '/usr/share/dict/words'; open my $fh, '<', $file or die "$! opening $file"; while(<$fh>) { chomp; /[^a-z]/ and next; "$_\n$_" =~ /^ (?= .* (.) .* \n (?: (?:(?!\1).)* \1 ){1} (?:(?!\1).) +* $ ) (?= .* (.) .* \n (?: (?:(?!\2).)* \2 ){2} (?:(?!\2).) +* $ ) (?= .{3,3} \n | .* (.) .* \n (?: (?:(?!\3).)* \3 ){3} (?:(?!\3).) +* $ ) (?= .{3,6} \n | .* (.) .* \n (?: (?:(?!\4).)* \4 ){4} (?:(?!\4).) +* $ ) (?= .{3,10} \n | .* (.) .* \n (?: (?:(?!\5).)* \5 ){5} (?:(?!\5).) +* $ ) (?= .{3,15} \n | .* (.) .* \n (?: (?:(?!\6).)* \6 ){6} (?:(?!\6).) +* $ ) (?= .{3,21} \n | .* (.) .* \n (?: (?:(?!\7).)* \7 ){7} (?:(?!\7).) +* $ ) (?= .{3,28} \n | .* (.) .* \n (?: (?:(?!\8).)* \8 ){8} (?:(?!\8).) +* $ ) (?= .{3,35} \n ) /x and print "$_ "; } print "\n";

        I had done that deliberately because I considered them "uninteresting". Easily fixed.

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1201500 use strict; use warnings; $| = 1; my $file = '/usr/share/dict/words'; open my $fh, '<', $file or die "$! opening $file"; while(<$fh>) { chomp; /[^a-z]/ and next; "$_\n$_" =~ /^ (?= .* (.) .* \n (?: (?:(?!\1).)* \1 ){1} (?:(?!\1).) +* $ ) (?= .{1,1} \n | .* (.) .* \n (?: (?:(?!\2).)* \2 ){2} (?:(?!\2).) +* $ ) (?= .{1,3} \n | .* (.) .* \n (?: (?:(?!\3).)* \3 ){3} (?:(?!\3).) +* $ ) (?= .{1,6} \n | .* (.) .* \n (?: (?:(?!\4).)* \4 ){4} (?:(?!\4).) +* $ ) (?= .{1,10} \n | .* (.) .* \n (?: (?:(?!\5).)* \5 ){5} (?:(?!\5).) +* $ ) (?= .{1,15} \n | .* (.) .* \n (?: (?:(?!\6).)* \6 ){6} (?:(?!\6).) +* $ ) (?= .{1,21} \n | .* (.) .* \n (?: (?:(?!\7).)* \7 ){7} (?:(?!\7).) +* $ ) (?= .{1,28} \n | .* (.) .* \n (?: (?:(?!\8).)* \8 ){8} (?:(?!\8).) +* $ ) (?= .{1,35} \n ) /x and print "$_ "; } print "\n";

        On my machine, this outputs:

        a aah acacia add aha aka all allele app arr ass assays b baa banana ba +obab bbl bedded bee bib bob bonobo boo bowwow brr bub c cocoon coo d +dad dds deadheaded deemed did doodad dud e ebb eddied eek eel eerier +eff effete egg eke ell ere err eve ewe eye f fee fesses foo g gag gee + gig goo h heeded hmm horror hubbub huh i ill inn inning j k l lee le +ssee lii lollop loo m mam mammal manana messes mom moo mum n nee need +ed non nun o odd off oho ooh opp p pap papaya pee peeped peeper pep p +epped pepper pip pippin poo pop powwow ppm ppr pup q r reefer revere +rococo s salsas see seeded senses sereneness settee shh sis sleeveles +s sqq ssh t tat tattoo tee teeter teethe tit too tot tut u usu v vii +w wedded wee weeded woo wow x xii xix xxi xxv xxviii xxxvii y z zoo
Re: Puzzle Regex: Letter Frequency Arithmetic Sequence
by LanX (Saint) on Oct 20, 2017 at 14:38 UTC
    Newer Perls have some fancy features in the regex engine (apart from embedding Perl), so probably it could be solved.

    But be warned, I wouldn't be surprised if a solution to this "puzzle" would proof the engine to be Turing complete.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!