Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Will it work?

by CountZero (Bishop)
on Mar 15, 2016 at 19:09 UTC ( #1157840=note: print w/replies, xml ) Need Help??

in reply to Perl Possibilities

Perl is certainly the best choice for such a problem, but it is not a magical bullet.

Perl excels in extracting data from many types of files, but whether there is actually a solution for your problem will less depend on the programming language than on the data you are given. If the data are in a more or less standard format, for instance, the recommendation is always the last sentence or paragraph off the file, then you have a fighting chance to succeed. But if the data is essentially free format then you will first have to solve the problem of natural language parsing and understanding and that is quite a different task!

That being said, I once had to extract from a database with several hundred of thousand description of claims, those records which concerned temperature damage to temperature controlled cargo in containers. I randomly let Perl choose about 500 records and marked these by hand to be "hit or miss". Then these records and "hit or miss" indications were given to a second Perl script that did a Bayesian analysis (there are modules on CPAN that provide all the basic infrastructure for you) and build a corpus of "hit" and "miss" words. With this corpus and the Bayesian analysis modules the whole database was analyzed and the "hits" identified. A final script extracted a random sample from these results that was checked by hand to see how accurate the process was and to give some statistically founded levels of confidence. If I remember well it had about 5% wrongly categorized records. Not a perfect result, but "good enough" for my purpose then and besides I only had one day to deliver a result.

Update: added description of a real use case.


A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157840]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2023-03-24 02:28 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (60 votes). Check out past polls.