Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^5: Remove duplicate from the same line..

by sundialsvc4 (Abbot)
on Jun 02, 2013 at 17:05 UTC ( #1036601=note: print w/replies, xml ) Need Help??

in reply to Re^4: Remove duplicate from the same line..
in thread Remove duplicate from the same line..

By suggesting a “separate file with a list of replacements,” I think that you just hit the nail on the head.   This is obviously a human-generated list, with variations in names that (humans know ...) refer to the same legal entity.   It would be quite difficult to write a completely satisfactory algorithm to “conclude that” some particular replacement should be done.   But, if you could provide a (human-generated and human-maintained) list of the replacements, then you could not only sanitize the list effectively, but you could also control and guide its operation.

For example, let’s say that you have a data-file containing records such as:

Goldman Sachs, LLC => Goldman Sachs

A Perl program could now read that file, split()ting it of course on /\s*\=\>\s*/, and thereby obtain a hash of “strings to be substituted,” and of “substitution strings,” and of the mappings from one to the other.   An input-record is interesting if it is contains any string that calls for substitution, and also if it contains more than one occurrence of an interesting string (which is taken to mean that the subsequent occurrences should be removed).   The algorithm can be diddled as needed ... it is now human-controlled.

Finally, a filter-program could be constructed which scans the file for strings which contain more-than-one occurrence of the same alphanumeric token, e.g. Goldman.   A human would eyeball that list and add to the substitutions-file as he or she deems fit.

  • Comment on Re^5: Remove duplicate from the same line..

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1036601]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2021-03-06 00:14 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (115 votes). Check out past polls.