This is a very good objection to the whole exercise. There is not really any way to know whether Smith Smith, Inc is a duplication or a valid company name. Without some real world knowledge I cannot see a way to distinguish between the two. As a remediation one could write all replacements into a log file for review and build a list of exceptions.
By suggesting a “separate file with a list of replacements,” I think that you just hit the nail on the head. This is obviously a human-generated list, with variations in names that (humans know ...) refer to the same legal entity. It would be quite difficult to write a completely satisfactory algorithm to “conclude that” some particular replacement should be done. But, if you could provide a (human-generated and human-maintained) list of the replacements, then you could not only sanitize the list effectively, but you could also control and guide its operation.
For example, let’s say that you have a data-file containing records such as:
Goldman Sachs, LLC => Goldman Sachs
A Perl program could now read that file, split()ting it of course on /\s*\=\>\s*/, and thereby obtain a hash of “strings to be substituted,” and of “substitution strings,” and of the mappings from one to the other. An input-record is interesting if it is contains any string that calls for substitution, and also if it contains more than one occurrence of an interesting string (which is taken to mean that the subsequent occurrences should be removed). The algorithm can be diddled as needed ... it is now human-controlled.
Finally, a filter-program could be constructed which scans the file for strings which contain more-than-one occurrence of the same alphanumeric token, e.g.Goldman. A human would eyeball that list and add to the substitutions-file as he or she deems fit.