Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

retain longest multi words units from hash

by Anonymous Monk
on Jul 28, 2018 at 10:42 UTC ( [id://1219394]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

I have an hash containing multi words units and their frequency. I would like to retain only the longest multi words units, discarding the units that are contained in the longest ones. So for example:

$VAR1 = { 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2 'automation technology process' => 3 };

I should get

$VAR1 = { 'rendition' => '3', 'saturation' => '3', 'mass creation' => 2, 'automation technology process' => 3 };

Do you know of any module that can achieve this with quite long hashes?

Replies are listed 'Best First'.
Re: retain longest multi words units from hash
by roboticus (Chancellor) on Jul 28, 2018 at 11:55 UTC

    At first glance, you'd think this would be a trivial problem, with a relatively straightforward solution like:

    Build new empty hash 'output' Build a list of keys from longest to shortest For each key in the list If key not a fragment of a key in the output hash Add key to output hash

    However, it feels like the problem isn't fully specified. How do you recognize units contained within other units, is it simply a substring?

    Often, when I see problems like this, I try to "break" them by contriving corner cases to throw into the mix. When I saw your problem, I came up with:

    my %phrase_counts = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, );

    I added two phrases so that we have something like 'A B C', 'A B', 'A C', and 'B C'. Obviously 'A B' and 'B C' are contained in 'A B C' and should be discarded. But what about 'A C', should that be considered to be contained in 'A B C'? What about 'A C B', 'C B'?

    If you don't think about these corner cases, then you can wind up in a frustrating cycle: you come up with an algorithm, code and test it, and submit it. Then the program runs OK for a bit, only to get a bug report against it when they come up with a corner case. You then update your algorithm, code and test it, submit it again. Then another bug report comes in.

    If you try to come up with ugly situations, you can save yourself a lot of time by asking for clarification for those special cases beforehand. Another benefit of coming up with special cases is that it can provide hints towards coming up with better algorithms--which may also prompt you to come up with a few more special cases.

    I can think of several variations of this problem and solutions to the variations, but don't know which one(s) to suggest.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thank you Roboticus. You are 100% right. There is need of more specifications to better define the task. I'll make treasure of this suggestion. I'll take a short break to think about all corner cases and come up with my Perl solution in order to know your opinion.

Re: retain longest multi words units from hash
by tybalt89 (Monsignor) on Jul 28, 2018 at 15:21 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1219394 use strict; use warnings; use Data::Dump 'dd'; my %phrase_counts = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, 'process automation' => 2, ); dd 'before', \%phrase_counts; my $words = ''; for my $w ( sort { length $b <=> length $a } keys %phrase_counts ) { if( $words =~ join '.*?', map "\\b$_\\b", split ' ', $w ) { delete $phrase_counts{$w}; } else { $words .= "$w\n"; } } dd 'after', \%phrase_counts;
      Greetings King of RegEx-Obfuscation! ;-)

      What would be the average length of $word for 1 million entries? 10 MB?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      ==== Update

      you can improve preformance if you sort by number of whitespaces first and only add to $word in chunks of strings with the same number of whitespaces.

      There is no point searching in an n word string being included in another n-word string.

      Particularly you can stop searching once you reached the one word strings.

      your interpretation of the question is different to mine.

      DB<50> p 'a b c' =~ join '.*?', map "\\b$_\\b", split ' ', 'a c' 1

      you seem to check for ordered subsets in the same order, while I only look for subsequences like a b and b c .

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        Don't you just love ambiguous specs? :)

Re: retain longest multi words units from hash
by kcott (Archbishop) on Jul 29, 2018 at 10:04 UTC

    Here's a solution that takes your posted input and creates your expected output.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dump; my %data = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, ); dd \%data; for my $multi_key (grep y/ / / != 0, keys %data) { next unless exists $data{$multi_key}; for my $any_key (keys %data) { next if $any_key eq $multi_key; delete $data{$any_key} if 0 == index $multi_key, $any_key; } } dd \%data;

    Output:

    { "automation" => 2, "automation technology" => 2, "automation technology process" => 3, "mass creation" => 2, "rendition" => 3, "saturation" => 3, } { "automation technology process" => 3, "mass creation" => 2, "rendition" => 3, "saturation" => 3, }

    I saw your post (in isolation) before I logged in, thought it looked like an interesting problem, and wrote my solution before looking at any other replies. My code doesn't use any regexes or sorting which may help efficiency (see the final dot-point below for a clarification of that statement); any similarities to components used in other solutions is purely coincidental (although perhaps not surprising, e.g. you'll see delete used quite a bit).

    I had a few issues with your spec; and now see I'm not alone. Again, some of these points may already have been raised.

    • Your OP words have "discarding the units that are contained in the longest ones"; however, your example only has keys (units) to be discarded that are at the start of the longest one. So how, for instance, would you deal with a key like 'partial saturation', given that it contains, but does not start with, the existing key 'saturation'.
    • You appear to assume that there can only be one longest key. How would you want to deal with, for example, 'automation technology special' (given the existing key 'automation technology process' which has the same length).
    • I interpret the spec as meaning that the two keys 'rendition' and 'renditions' would both be kept. Is that indeed what you want?
    • You wrote "quite long hashes"; unfortunately, that's rather vague. Descriptors such as "quite long", "fairly short" and so on, are highly subjective: often relative to the problem domain and the writer's experience. Actual numbers are much better; including other numbers such as available memory, disk size, etc. is better still (assuming they're relevant).
    • There's another issue with "quite long hashes". Does this refer to the number of key-value pairs or the actual length of the keys? Your examples suggest the values are all just small integers so that's not a size issue: is that correct?
    • Another piece of information, that would be useful to know, is the percentage of "multi word" keys. The single word keys are not actually part of the processing (beyond possibly being deleted). If your hash contains a million keys, do we need to process 10, 1,000, 100,000 or 999,999 multi word keys? This could make a difference to how we formulate a solution. I wrote (above) "... doesn't use any ... sorting which may help efficiency": the information about how many multi word keys there are could help to determine if adding sorting would, in fact, result in better efficiency; it could also help in deciding what type of sorting would be best and where such sorting might occur.

    Update (typo): s/beyong/beyond/

    — Ken

Re: retain longest multi words units from hash
by tobyink (Canon) on Jul 29, 2018 at 16:52 UTC

    People have posted other solutions. This is probably no worse or no better…

    use strict; use warnings; use Data::Dump qw/dd/; my %phrases = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, ); sub filter_wordlist_thing { my %output = %{+shift}; for my $key (grep / /, keys %phrases) { my @words = split / /, $key; my @word_combos = grep $_ ne $key, map join(" ", @words[$_->[0]..$_->[1]]), map { my $start = $_; map [$start, $_], $start .. $#words +} 0 .. $#words; delete @output{@word_combos}; } \%output; } dd filter_wordlist_thing(\%phrases);

        Yeah, pretty similar, but doesn't impose a predetermined maximum word length on the hash keys. I also optimize by skipping single-word hash keys.

Re: retain longest multi words units from hash
by LanX (Saint) on Jul 28, 2018 at 11:32 UTC
    why not:
    • loop over the keys
    • split multi-words into a list
    • if that list contains more than one word:
      delete single words in that list,
    Hint: you can also delete hash slices.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    update

    OK I see the problem, this is only partially solving your case:

    use strict; use warnings; use Data::Dump qw/pp dd/; my %hash = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, # why did you post broken code with a missing comma? 'automation technology process' => 3 ); delete @hash{ map { split / / } grep{/ /} keys %hash }; print pp \%hash;
    { "automation technology" => 2, "automation technology process" => 3, "mass creation" => 2, "rendition" => 3, "saturation" => 3, }

      Another hint - sort the keys in order of ascending length, so you have all the short words ready for deletion in the hash when the longer words get examined.

        The other way round, yes. (delete short words)

        But it's more complicated, he want's to delete 2 word groups if they are contained in a 2+n word group.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      OK supposing you have at most 4 words in a group (otherwise expand $max_del)

      use strict; use warnings; use Data::Dump qw/pp dd/; use feature 'say'; my %hash = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3 ); # prepare sub-slices my @slices; my $max_del = 3; for my $del ( 0 .. $max_del ) { for my $i (0 .. $del) { for my $l ( $i .. $del) { push @{ $slices[$del] }, [$i .. $l] unless $i==0 and $l == $del; # exclude identity } } } sub partitions { my $group =shift; my @words = split / /,$group; return map { join " ", @words[@$_] } @{$slices[$#words]} } delete @hash{ partitions($_) } for keys %hash ; say "Result: ", pp \%hash; say "# --- tests"; say '@slices = ', pp \@slices; say "partitions('automation technology process limit') = " , pp partitions('automation technology process limit');

      Result: { "automation technology process" => 3, "mass creation" => 2, "rendition" => 3, "saturation" => 3, } # --- tests @slices = [ [], [[0], [1]], [[0], [0, 1], [1], [1, 2], [2]], [[0], [0, 1], [0, 1, 2], [1], [1, 2], [1, 2, 3], [2], [2, 3], [3]], ] partitions('automation technology process limit') = ( "automation", "automation technology", "automation technology process", "technology", "technology process", "technology process limit", "process", "process limit", "limit", )

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Thank you for the suggestion (PS: sorry for having missed the comma!). This approach is interesting, still doesn't solve the problem of recursiveness, i.e. deleting 2 units words in a 3 multi words unit. Or, for what it matters, of a 3 multi words unit in a 4 multi words unit.

      Probably:

      • sort keys small to big
      • loop over the keys
      • check each key if it is contained in all other keys, if yes delete it, if not retain it

      This approach should work, but I am not sure if it make sense in terms of performances with big hashes

        > This approach should work, but I am not sure if it make sense in terms of performances with big hashes

        I had the same idea, performance depends on the nature of your data. After sorting you'd have O(n˛) i.e. (n**2 -n)/2  n *(n+1) * 1/2 == (n**2 + n)/2 comparisons from smaller in bigger keys.*

        NB: you need to specify what words "are" and be careful not to delete "automation tech" because of "automation technology"

        In my approach you'd calculate all subgroups of key (5 for "automation technology process") and delete them.

        Performance depends then on the number of long groups you have.

        As already mentioned, I don't know the nature of your data ˛...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        *) this can be improved by preparing a lookup table for single words, "automation technology" would listed for "automation" and "technology", so as soon you see "automation technology process" you only need to check the entries in the lookup table.

        ˛) when looking for algorithms check the mathematics of Posets

        UPDATE

        I think we had the same question here already, better check the archives.

Re: retain longest multi words units from hash
by shmem (Chancellor) on Jul 29, 2018 at 09:56 UTC
    Do you know of any module that can achieve this with quite long hashes?

    update 2: the solution below the 1th update doesn't weed out the key 'technlogy' which is found as partial key in 'automation technology' and 'automation technology process'. B_TREE partial key matching isn't useful, since the match starts at the beginning of the key. DB_File is good to store huge simple hashes and quite fast, so this statement remains.

    Here's a solution:

    use strict; use warnings; use Data::Dump; my %h = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'technology' => 5, 'process' => 6, 'creation' => 7, 'saturation process' => 9, 'automation technology' => 2, 'automation technology process' => 3, ); dd 'before', \%h; for (keys %h) { if (/\s/) { my @l = split ' '; for my $k (map {my $i=$_;map{join" ",@l[$i..$_]}$i..$#l}0..$#l +) { delete $h{$k} unless $k eq $_; } } } dd 'after', \%h; __END__ ( "before", { "automation" => 2, "automation technology" => 2, "automation technology process" => 3, "creation" => 7, "mass creation" => 2, "process" => 6, "rendition" => 3, "saturation" => 3, "saturation process" => 9, "technology" => 5, "technology process" => 5, }, ) ( "after", { "automation technology process" => 3, "mass creation" => 2, "rendition" => 3, "saturation process" => 9, }, )

    Skip the rest of this reply.

    update: wrong, scratch hat, we need to use B_TREE with partial match; wait for next update...

    DB_File, using the DB_BTREE format is handy for that. Since keys are stored in lexical order, it suffices to iterate over the keys, and delete the previous key if it matches the current one:

    use strict; use warnings; use DB_File; use Fcntl; use File::Temp qw(:POSIX); use Data::Dumper; $Data::Dumper::Indent = 1; my $filename = tmpnam(); # temporary file my %h; # tied DB_File hash tie %h, "DB_File", $filename, O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot open $filename: $!\n"; %h = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, ); my $prev; for (keys %h) { delete $h{$prev} if $prev && /^$prev/; $prev = $_; } print Dumper(\%h); # cleanup untie %h; unlink $filename; __END__ $VAR1 = { 'automation technology process' => '3', 'mass creation' => '2', 'rendition' => '3', 'saturation' => '3' };

    DB_File uses a file on disk, so quite long hashes are possible. Since keys used in a for loop just iterates over the keys, there's no need to retrieve all the keys and sort them. BTREE has them sorted already.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: retain longest multi words units from hash
by atcroft (Abbot) on Jul 30, 2018 at 01:37 UTC

    Here is my attempt at a solution (also trying my hand at POD (which I badly need to learn!)). Comments and corrections appreciated, as always.

    Solution overview: This particular solution attempt takes all loaded entries, processing them in key length (smallest first) then string order. Each key is split on non-word characters. If no non-word characters are present, the entry is copied from the originating hash to the working hash. If one or more non-word characters are present, then the possible partial keys are generated. For each partial key, if an entry for the partial key exists in the working hash, it is deleted. Once the partial keys have been processed, the current entry is copied to the working hash.

    Test results:

    Code:

    Hope it helps.

Re: retain longest multi words units from hash
by Anonymous Monk on Jul 28, 2018 at 15:02 UTC

    This is what I managed to do after reading the various answers. I consider a multi words unit (MWU) to be contained in a bigger one only if the sequence of words is the same (so technology process is contained in automation technology process', but 'automation process is not). I do not have other constraints I can think of.

    Be aware, my code is very verbose....

    use strict; use warnings; use Data::Dump qw/pp dd/; my %output; my %phrases = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, ); foreach my $keyPhrases (sort hashKeyLength (keys(%phrases))) { my $TermToBeRetained=1; print "\nWorking on * $keyPhrases *\n"; foreach my $keyOutput (keys %output){ print "\tChecking if * $keyPhrases * is contained in * $keyOut +put *: "; if ($keyOutput =~/$keyPhrases/){ print "Yes\n"; $TermToBeRetained=0; last; } else{ print "No!\n"; $TermToBeRetained=1; } } if ($TermToBeRetained eq 1){ print "Retaining $keyPhrases\n"; $output{$keyPhrases} = $phrases{$keyPhrases};; } else{ print "Skipping $keyPhrases\n"; } } print "MY RESULT\n"; print pp \%output; sub hashKeyLength { count_units_in_phrase($b) <=> count_units_in_phrase($a) || $b cmp +$a } sub count_units_in_phrase { my $string=shift; my $num; $num++ while $string =~ /\S+/g; return $num }

    What do you think of it?

    Discipuls added code tags 28 July 2018

      Put code tags around it.

        Unfortunately I can not edit my post

Re: retain longest multi words units from hash
by PatrickofJohn (Novice) on Jul 29, 2018 at 20:52 UTC
    Clearly an XY problem as the sample data makes no sense.

    How the heck did your sample hash get populated with a larger count of the longest string (automation technology process) than its substrings (automation) and (automation technology)

    What are you really asking for here?

      The example was obviously handmade and not copied, hence the missing comma.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1219394]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-24 07:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found