Perl Possibilities

Gideau has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Will it work? by Corion (Patriarch) on Mar 15, 2016 at 10:40 UTC
I think this will highly depend on how machine-readable your recommendations are, and how much prose there is. Perl can read various file formats, such as Excel, MS Word, text files and some versions of PDF formats to extract the information stored in it. One large issue will be extracting the recommendation once you have the raw text. This will border on "sentiment analysis" unless you can come up with a sure-fire list of phrases that indicate a strong recommendation. If you have never programmed before, I recommend that you use a programm which your faculty is familiar with.	[reply]
Re: Will it work? by choroba (Cardinal) on Mar 15, 2016 at 10:40 UTC
It's definitely possible, but the level of complexity depends on many aspects you didn't mention. What format are the filings in? Plain text, Word documents (which version), PDF, other? How large are the filings? What languages were used to write them? How standardised are they? See for example the Lingua and Treex namespaces for modules that could help you process natural language. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re: Perl Possibilities by ww (Archbishop) on Mar 15, 2016 at 11:45 UTC
The master's candidate (OP) formatted this better than s/he did with Reaped: Will it work? but the original has two excellent replies (as of 0730 EDT 201603/15). If this thread is retained (the original had been considered for reaping before I saw it) the replies should be reparented here. Now, FWIW, a brief endorsement of those replies: source format(s) are important; degree of standardization is probably even more so, unless OP is willing to create an exhaustive list of equivalents and close alternates to "recommendation" and to do additional parsing to assure that the extracted information isn't polluted by comments such as "cannot in good conscience recommend (proposal A \|\| proposal B). If the source data is a set of Y/N answers to questions, that's not going to be difficult, but if it's (conversational or formal) prose, then going thru mere thousands of filings is apt to look quite easy. And what sort of sample data (other than the actual source data) would provide a thorough test of the code? `++$anecdote ne $data`	[reply] [d/l]
Re: Perl Possibilities by Ea (Chaplain) on Mar 15, 2016 at 15:33 UTC
Yes, Perl is one of many languages that can help you with Text Mining and Information Retrieval. It certainly helped with my thesis, but like all tools, it can take some time to learn to use them. What you haven't clearly said is whether you only want to do a bit of parsing/extracting or go full on into Natural Language Processing which is why you were asked for a small sample to give us an idea what you're after. As mentioned above, your baby steps will likely involve regular expressions which is actually a small language all on it's own, but IMHO, Perl has the best regex implementation. You can find a short introduction in Modern Perl. I liked it so much, I bought a paper copy. There are a lot of third party libraries available from CPAN that go beyond what's easily achievable with the basic language. Investigate the Lingua::EN modules for some ideas and try not to re-invent the wheel without a good reason. Best of luck, Sometimes I can think of 6 impossible LDAP attributes before breakfast.	[reply]
Re: Will it work? by CountZero (Bishop) on Mar 15, 2016 at 19:09 UTC
Perl is certainly the best choice for such a problem, but it is not a magical bullet. Perl excels in extracting data from many types of files, but whether there is actually a solution for your problem will less depend on the programming language than on the data you are given. If the data are in a more or less standard format, for instance, the recommendation is always the last sentence or paragraph off the file, then you have a fighting chance to succeed. But if the data is essentially free format then you will first have to solve the problem of natural language parsing and understanding and that is quite a different task! That being said, I once had to extract from a database with several hundred of thousand description of claims, those records which concerned temperature damage to temperature controlled cargo in containers. I randomly let Perl choose about 500 records and marked these by hand to be "hit or miss". Then these records and "hit or miss" indications were given to a second Perl script that did a Bayesian analysis (there are modules on CPAN that provide all the basic infrastructure for you) and build a corpus of "hit" and "miss" words. With this corpus and the Bayesian analysis modules the whole database was analyzed and the "hits" identified. A final script extracted a random sample from these results that was checked by hand to see how accurate the process was and to give some statistically founded levels of confidence. If I remember well it had about 5% wrongly categorized records. Not a perfect result, but "good enough" for my purpose then and besides I only had one day to deliver a result. Update: added description of a real use case. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Perl Possibilities by Gideau (Novice) on Mar 15, 2016 at 12:16 UTC
So in response to the comments on this and also my badly formatted one: The files containing the information are quite standardized and in .txt format written in English. Every filing contains the phrase "the board of directors recommends to vote for/against...". Basically I want to match the outcome of the votes (which I already have in a database) being PASS or FAIL with a simple recommendation being YES or NO which I hope to obtain from the filings with Perl. Also I'm wondering how the final output of the Perl program could look like. This is because I will have to merge the already existing vote outcome database with their corresponding recommendations. Is this also something that can be done? Thanks for the quick replies!	[reply]
Re^2: Perl Possibilities by LanX (Saint) on Mar 15, 2016 at 12:32 UTC
Please show example .txt ... complicated cases please > Is this also something that can be done? Yes No Depends ;) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: Perl Possibilities by Gideau (Novice) on Mar 15, 2016 at 12:42 UTC
How exactly do you want me to do that? Not sure if I can upload something here?	[reply]
Re^4: Perl Possibilities by AnomalousMonk (Archbishop) on Mar 15, 2016 at 15:04 UTC
Re^2: Perl Possibilities by graff (Chancellor) on Mar 16, 2016 at 05:37 UTC
Based on your description, something like this might be a good start. It assumes you can run a pipeline command that feeds the list of file names to the perl script's STDIN - e.g. using a bash shell with standard GNU/linux utils, a command line like this: `cd {top_level_directory_where_all_files_are_located} find * -type f \| search_script.pl > hit_list.txt` [download] And "search_script.pl" would be something like this: `#!/usr/bin/perl use strict; use warnings; while (<STDIN>) { chomp; next unless -f(); my $text = do { if ( open( my $fh, $_ )) { local $/; <$fh>; } else { warn "Unable to read $_\n"; } }; if ( $text =~ /\s(recommend\S\s+to\s+vote\s+\S+)/ ) { ( my $hit = $1 ) =~ s/\s+/ /g; print "$_: $hit\n"; } else { print "$_: NO_MATCH\n"; } }` [download] Note that in bash you can redirect STDERR as well: `find -type f \| search_script.pl > hit_list.txt 2> search.errlog` [download] The output to STDOUT will tell you which files have the sought-for text, and what the text was. It also lists the files that failed to match, so you can take a closer look at those, and tweak the regex as needed. The regex proposed above will match all the inflections on "recommend" (-ed, -ing, -s, -ation), and will capture the matched phrase only up to the word that follows "vote". (You can extend the capture to include more words before and/or after, if you like, by adding more `\s+\S+\s+` elements inside the parens.) When there's a match, all kinds of white-space between words is allowed, and it's all normalized to a single space before output, to ensure one line of output per file.	[reply] [d/l] [select]
Re^3: Perl Possibilities by Gideau (Novice) on Mar 16, 2016 at 12:49 UTC
This actually really looks like something I can use! Thank you very much! I will have to add something in the code that would differentiate the different proposals from each other and match the recommendations and proposals, since there could be more proposals per filing, and thus more recommendations. This, however, is a very very nice start for me:D	[reply]
Re^4: Perl Possibilities by graff (Chancellor) on Mar 18, 2016 at 22:05 UTC
Re: Perl Possibilities by Anonymous Monk on Mar 15, 2016 at 11:42 UTC
Yes, it's possible! The basic approach would be to use one or more regular expressions to extract the information from the source text. However, the best way to go about it depends on the format of the source files, could you tell us that? Is it possible for you to provide an example? If you haven't yet, you should probably start with perlintro and/or Getting Started with Perl to get an overview of how to work with Perl.	[reply]
Re: Will it work? by Marshall (Canon) on Mar 15, 2016 at 21:04 UTC
Without an example, I have trouble answering your question. However, if the situation is one where a very detail oriented person who knows minimal English could sit there and look at 1,000 papers and summarize the results by extracting certain key phrases, even without knowing exactly what they mean, then the probability is high that a program can be written to do that. Programs don't work well with "sort of" or "interpret what you think about this...". "Recommend: Yes/NO" is something that a program can detect. "I'm leaning towards voting Yes, but at this time, I am unsure" is something that a program has close to zero chance of figuring out. To have a chance at this, you need to identify some key phrases and a syntax that a very, very literal detailed person could use to extract your info. This very, very literal detailed person (the program) will do its job flawlessly, but only within very strict rules. You could wind up in a situation where the program can do 900 of 1,000 files with a clear result, but yet you wind up with 100 to do manually. This has to do with the "rules" and whether the detailed savant (the program) can tell if it got a valid result or not. I've worked with situations where the program can get to 99.5% with certainty, but for the other 0.5%, it knows that it is not certain. Update: 0.5% may not seem like a lot, but if there are 350,000 records, this is a big deal. Try to find some simple rules where you are absolutely certain that the correct result has been found. Then see what that percentage that is. If that is 90%, then you are probably in pretty good shape as the program did 90% of the work! To get something like this completely automated, the program may need to start applying some ad-hoc rules that involve some uncertainty and that means that the program will guess "wrong" some of the time. You have to decide whether that matters or not?	[reply]
Re: Perl Possibilities by Gideau (Novice) on Mar 16, 2016 at 12:39 UTC
Hey guys! Thanks so much already for all the helpful comments you've given me. I feel that I'm already making progress on tackling my research! So as some of you asked, I will post a small subset of text that contains the information I need. To give some context: it's from a SEC Filing that companies have to do when there are any proposals made by shareholders. As said before, I'm looking for the recommendation of the Board of Directors on these proposals. An example of a filing that I will be using can be found here: Filing Example The filings differ between different companies in terms of proposals etc. However, in each filing there is this recommendation that I'm looking for that is (almost) always stated in the same manner. I hope this gives some more clarity! Thanks again so much for your help. I really appreciate it!	[reply]
Re^2: Perl Possibilities by Corion (Patriarch) on Mar 16, 2016 at 12:43 UTC
As the data already is in a fairly tabular format, and in HTML, I would use HTML::TableExtractor to get at the table data. With the data in hand, it should be easy to extract the vote recommendations by looking whether `FOR` or `AGAINST` is contained in the relevant column.	[reply] [d/l] [select]
Re^3: Perl Possibilities by Gideau (Novice) on Mar 16, 2016 at 12:57 UTC
You're right about that indeed. However, the problem is that very few companies use such a table as in the example where they clearly state the proposals and their recommendations, as far as I know. Furthermore, I've already downloaded quite a few filings for testing purposes, and they end up being in .txt file however still formatted in html (so you see all the html code in the .txt surrounding the actual text). Would you say it's smarter to keep the .txt or convert back to .html before I do the extraction scripts?	[reply]
Re^4: Perl Possibilities by Corion (Patriarch) on Mar 16, 2016 at 15:56 UTC
Re^4: Perl Possibilities by ww (Archbishop) on Mar 16, 2016 at 16:10 UTC
Re^5: Perl Possibilities by Anonymous Monk on Mar 16, 2016 at 16:18 UTC
Some notes below your chosen depth have not been shown here
Re: Perl Possibilities by perlfan (Vicar) on Mar 16, 2016 at 16:44 UTC
protip: Some people say "Perl" refers to the language and "perl" refers to the binary interpreter. I am in that camp. But everyone agrees that one should never write, "PERL". In fact, if you see that on a job posting or resume, run. Run far away.	[reply]


No such thing as a small change
	PerlMonks