The first thing you need to do is define how you, as a human being, would judge the similarity of the sets.
For example, you start with a set (A), and you make an exact copy (B). You will (presumably) judge these as very similar.
- What if You remove 1 of the phrases. Are they still similar?
If the original set contains 100 phrases, and you remove phrases 1 at a time from the duplicate, does the similarity drop linearly?
- What if you reversed the words in all of the phrases in the second set. Is it still very similar or completely dissimilar?
Is ordering of the phrase words important.
- How about if you removed one word from each phrase in the second set?
Do the phrases need to be exactly the same, to be counted similar.
- How about if you looked up each word in a thesaurus and substituted the nearest alternative word. Similar? Dissimilar?
Are looking for semantic similarity.
- How about if you misspelled every word by one character -- an ommision, and insertion, or transposition. Similar? Dissimilar?
Can typos occur? Is it possible for you to correct them?
- How about if you reverse the ordering of the phrases in the second set. Similar? Dissimilar?
Are the sets ordered or unordered.
- If one set consists entirely of "large blue woolen jumper" and the other "Angora sweater, navy, XL". Similar? Dissimilar?
Semantics again.
Once you've decided how you would make the judgement, then you stand some chance of being able to lay out a set of rules. And once you have that, you can start to look for a good way to implement them.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|