Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^4: Remove duplicate from the same line..

by hdb (Monsignor)
on Jun 02, 2013 at 06:12 UTC ( [id://1036550]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Remove duplicate from the same line..
in thread Remove duplicate from the same line..

This is a very good objection to the whole exercise. There is not really any way to know whether Smith Smith, Inc is a duplication or a valid company name. Without some real world knowledge I cannot see a way to distinguish between the two. As a remediation one could write all replacements into a log file for review and build a list of exceptions.

Replies are listed 'Best First'.
Re^5: Remove duplicate from the same line..
by sundialsvc4 (Abbot) on Jun 02, 2013 at 17:05 UTC

    By suggesting a “separate file with a list of replacements,” I think that you just hit the nail on the head.   This is obviously a human-generated list, with variations in names that (humans know ...) refer to the same legal entity.   It would be quite difficult to write a completely satisfactory algorithm to “conclude that” some particular replacement should be done.   But, if you could provide a (human-generated and human-maintained) list of the replacements, then you could not only sanitize the list effectively, but you could also control and guide its operation.

    For example, let’s say that you have a data-file containing records such as:

    Goldman Sachs, LLC => Goldman Sachs

    A Perl program could now read that file, split()ting it of course on /\s*\=\>\s*/, and thereby obtain a hash of “strings to be substituted,” and of “substitution strings,” and of the mappings from one to the other.   An input-record is interesting if it is contains any string that calls for substitution, and also if it contains more than one occurrence of an interesting string (which is taken to mean that the subsequent occurrences should be removed).   The algorithm can be diddled as needed ... it is now human-controlled.

    Finally, a filter-program could be constructed which scans the file for strings which contain more-than-one occurrence of the same alphanumeric token, e.g. Goldman.   A human would eyeball that list and add to the substitutions-file as he or she deems fit.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1036550]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-03-28 12:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found