![]() |
|
laziness, impatience, and hubris | |
PerlMonks |
Re: Will it work?by CountZero (Bishop) |
on Mar 15, 2016 at 19:09 UTC ( #1157840=note: print w/replies, xml ) | Need Help?? |
Perl is certainly the best choice for such a problem, but it is not a magical bullet. Perl excels in extracting data from many types of files, but whether there is actually a solution for your problem will less depend on the programming language than on the data you are given. If the data are in a more or less standard format, for instance, the recommendation is always the last sentence or paragraph off the file, then you have a fighting chance to succeed. But if the data is essentially free format then you will first have to solve the problem of natural language parsing and understanding and that is quite a different task! That being said, I once had to extract from a database with several hundred of thousand description of claims, those records which concerned temperature damage to temperature controlled cargo in containers. I randomly let Perl choose about 500 records and marked these by hand to be "hit or miss". Then these records and "hit or miss" indications were given to a second Perl script that did a Bayesian analysis (there are modules on CPAN that provide all the basic infrastructure for you) and build a corpus of "hit" and "miss" words. With this corpus and the Bayesian analysis modules the whole database was analyzed and the "hits" identified. A final script extracted a random sample from these results that was checked by hand to see how accurate the process was and to give some statistically founded levels of confidence. If I remember well it had about 5% wrongly categorized records. Not a perfect result, but "good enough" for my purpose then and besides I only had one day to deliver a result. Update: added description of a real use case. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
In Section
Seekers of Perl Wisdom
|
|