Re^2: Finding Nearly Identical Sets

BrowserUk,
I am going to answer some of your questions out of order (easiest to answer to hardest).

You already have a method of finding & extracting the number from the message; even if it is 8 or 10 digits.

Correct

What I'm getting at here is the insertions or deletions are easily detected, because the numbers are the wrong length. That only really leaves transpositions to detect?

Incorrect. While it is trivial to determine that a mistake has been made, it isn't necessarily trivial to determine what the mistake was (which set the number should have been).

You already have the set of known 9-digits numbers to compare against?

No. Think of it this way. The set represents what should be a unique identifier for the message sender. The first time I receive a message with a valid set (meets my aforementioned criterion) that doesn't appear to be similar to any other previously seen set - I will use that set to identify that sender. Not because it is correct but because it was first. A subsequent message from that sender may be objectively correct but for my purposes I don't know - I only know that it is nearly identical to what I have already seen.

If so, how are they currently stored?

I'm still figuring that out but because there will be parallel processing going on, I am assuming it will be in a database and finding good candidates will need to be 1 or more index based queries.

Hopefully I have answered all of your questions satisfactorily enough to advance the problem but my cold medicine is really doing a number on my cognitive abilities.

Cheers - L~R

Comment on Re^2: Finding Nearly Identical Sets

Replies are listed 'Best First'.
Re^3: Finding Nearly Identical Sets by BrowserUk (Patriarch) on Sep 28, 2016 at 19:28 UTC
a valid set When you say "set", you're referring to one group of nine, single digits? (I'm wondering why "set" rather than just 'number'?) With nine digits, 1 .. 9, there are 888,888,888 possibilities; how many of those do you expect to be "receiving"? Over what time period? Do you reset at any point? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Finding Nearly Identical Sets by Limbic~Region (Chancellor) on Sep 29, 2016 at 11:45 UTC
BrowserUk, I was trying to reduce the problem down to only the parts that matter. Through out the message, there are 10 markers - I am extracting and only interested in 9 of them. It is weird and highly domain specific (which I can't discuss). This "set" of numbers for a given message is valid over a few days (let's say 10 days tops). The reset happens when the sender's right to send expires and they are removed so it is possible for the same numbers to be re-used at some point in the future but they would be considered "new" then. At any one time, I think I would have around 500K of the nearly 900 million possible. Cheers - L~R	[reply]
Re^5: Finding Nearly Identical Sets by BrowserUk (Patriarch) on Sep 29, 2016 at 12:04 UTC
Through out the message, there are 10 markers - I am extracting and only interested in 9 of them. Well, if the markers are encoded as digits, the code I posted elsewhere will still work. It would reduce memory a lot, (and make the random generation for testing a bit easier and quicker), if they were encoded as 0..8 rather 1..9. At any one time, I think I would have around 500K of the nearly 900 million possible. With method I've demonstrated, that becomes a moot point, as it will continue to work -- with the same memory requirement, right up to the point of having all 900M in play. The real problem comes with the compounding effect of the edits. If you have 500k 9-digit knowns, then there are 67,108,864,000,000 possible 1-digit substitutions for them. Of course, there are still only 900M possible 9-digit sets, so each of those can be a substitute for ~75,000 actuals. Your problem will be to determine which of the actuals it is a match for. The code I posted elsewhere is flawed. I completely forgot to test for the substitutions in that version. I've corrected that and am currently doing a test to check I didn't screw something else up. When its done, and assuming I didn't, I post the new version as a reply to the existing. (This is just a heads up for you not to waste any of your time on that version.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]


go ahead... be a heretic
	PerlMonks