By suggesting a “separate file with a list of replacements,” I think that you just hit the nail on the head. This is obviously a human-generated list, with variations in names that (humans know ...) refer to the same legal entity. It would be quite difficult to write a completely satisfactory algorithm to “conclude that” some particular replacement should be done. But, if you could provide a (human-generated and human-maintained) list of the replacements, then you could not only sanitize the list effectively, but you could also control and guide its operation.
For example, let’s say that you have a data-file containing records such as:
Goldman Sachs, LLC => Goldman Sachs
A Perl program could now read that file, split()ting it of course on /\s*\=\>\s*/, and thereby obtain a hash of “strings to be substituted,” and of “substitution strings,” and of the mappings from one to the other. An input-record is interesting if it is contains any string that calls for substitution, and also if it contains more than one occurrence of an interesting string (which is taken to mean that the subsequent occurrences should be removed). The algorithm can be diddled as needed ... it is now human-controlled.
Finally, a filter-program could be constructed which scans the file for strings which contain more-than-one occurrence of the same alphanumeric token, e.g. Goldman. A human would eyeball that list and add to the substitutions-file as he or she deems fit.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||