Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: String Comparison & Equivalence Challenge

by bliako (Monsignor)
on Mar 14, 2021 at 09:50 UTC ( [id://11129607]=note: print w/replies, xml ) Need Help??


in reply to String Comparison & Equivalence Challenge

regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.

Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):

use List::Util qw(reduce); # store similarities as a sparse matrix as a 2-level hash my $S = {}; # metric weights, all 1's means not weighted, usually sum-of-weights=1 my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1]; # get a list of similarity values as a hash, keyed on metric names my $sims = similarity($phrase1, $phrase2); # get the most similar to phrase1 my $most = most_similar($phrase1); print "most similar to '$phrase1' is ".$most->{'phrase'}."\n"; # main entry to finding similarity between phrases A and B sub similarity { my ($A, $B) = @_; if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){ # useless negation to satisfy certain monks' pet peeve $S->{$A}->{$B} = { 'metric-1' => metric1($A,$B), 'metric-2' => metric2($A,$B), 'metric-3' => metric3($A,$B), }; # this is a weighted similarity, it's a rough 1D metric based # on all other metrics. my $weighted = 0; $weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W; $S->{$A}->{$B}->{'weighted'} = $weighted; } return $S->{$A}->{$B} } # calculate similary between phrases A and B using metric1 sub metric1 { my ($A,$B) = @_; return ... # a real e.g. 3.5 } sub most_similar { my ($A, $metric_name) = @_; if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){ $metric_name = 'weighted' } my $w = $S->{$A}; my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} > + $w->{$a}->{$metric_name} ? $b : $a } keys %$w; my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name}; return { 'phrase' => $max_sim_phrase, 'value' => $max_sim_value } }

Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.

bw, bliako

Replies are listed 'Best First'.
Re^2: String Comparison & Equivalence Challenge
by LanX (Saint) on Mar 14, 2021 at 17:52 UTC
    > Using the ancient greek bible text could be even more challenging finding models.

    Only the New Testament is originally in Greek, the old one is in Hebrew and Aramaic AFAIK.

    Please correct me.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I had no idea but it's reasomable (edit: what you say).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129607]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-25 09:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found