Re: Will it work?

by CountZero (Bishop)
in reply to Perl Possibilities

Perl is certainly the best choice for such a problem, but it is not a magical bullet.

Perl excels in extracting data from many types of files, but whether there is actually a solution for your problem will less depend on the programming language than on the data you are given. If the data are in a more or less standard format, for instance, the recommendation is always the last sentence or paragraph off the file, then you have a fighting chance to succeed. But if the data is essentially free format then you will first have to solve the problem of natural language parsing and understanding and that is quite a different task!

That being said, I once had to extract from a database with several hundred of thousand description of claims, those records which concerned temperature damage to temperature controlled cargo in containers. I randomly let Perl choose about 500 records and marked these by hand to be "hit or miss". Then these records and "hit or miss" indications were given to a second Perl script that did a Bayesian analysis (there are modules on CPAN that provide all the basic infrastructure for you) and build a corpus of "hit" and "miss" words. With this corpus and the Bayesian analysis modules the whole database was analyzed and the "hits" identified. A final script extracted a random sample from these results that was checked by hand to see how accurate the process was and to give some statistically founded levels of confidence. If I remember well it had about 5% wrongly categorized records. Not a perfect result, but "good enough" for my purpose then and besides I only had one day to deliver a result.

