Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Perl Possibilities

by Gideau (Novice)
on Mar 15, 2016 at 10:34 UTC ( [id://1157781]=perlquestion: print w/replies, xml ) Need Help??

Gideau has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Perl Monks,

I have a question regarding the possibilities of perl that I'm desperately seeking to answer... The situation is as follows:

Currently I'm writing my Master's thesis in which I'm planning on using Perl to analyze certain filings and extract information from it. We're talking about a couple of thousand of them, making doing it by hand not feasible. In every one of these filings there is a certain recommendation that is given on a proposal (as in: he/she recommends to vote for/against the proposal). This recommendation is the piece of information I need, and thus I want to extract this with Perl from the filings, combining the recommendation with the issued proposal in order to successfully continue my research. My questions thus are: is it possible to do this with Perl? And, if possible, how difficult would it be to create a code that does exactly that? Basically I would need to extract the recommendation, and the proposal he/she recommends on in a way so that I can use it for further statistical analysis in e.g. Stata.

I wrote it as concise as possible, but I hope it's complete enough:).

I hope you splendid monks have the answer to all my questions! Thanks in advance:).

UPDATE:

Here a link to an example filing: Filing example

The filing originally is in .html, however I downloaded a couple of thousands with a script ending up with .txt files which have all the html code in it around the text, which makes it less readable; at least to the human eye.

Replies are listed 'Best First'.
Re: Will it work?
by Corion (Patriarch) on Mar 15, 2016 at 10:40 UTC

    I think this will highly depend on how machine-readable your recommendations are, and how much prose there is. Perl can read various file formats, such as Excel, MS Word, text files and some versions of PDF formats to extract the information stored in it.

    One large issue will be extracting the recommendation once you have the raw text. This will border on "sentiment analysis" unless you can come up with a sure-fire list of phrases that indicate a strong recommendation.

    If you have never programmed before, I recommend that you use a programm which your faculty is familiar with.

Re: Will it work?
by choroba (Cardinal) on Mar 15, 2016 at 10:40 UTC
    It's definitely possible, but the level of complexity depends on many aspects you didn't mention.

    What format are the filings in? Plain text, Word documents (which version), PDF, other? How large are the filings? What languages were used to write them? How standardised are they?

    See for example the Lingua and Treex namespaces for modules that could help you process natural language.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Perl Possibilities
by ww (Archbishop) on Mar 15, 2016 at 11:45 UTC

    The master's candidate (OP) formatted this better than s/he did with Reaped: Will it work? but the original has two excellent replies (as of 0730 EDT 201603/15). If this thread is retained (the original had been considered for reaping before I saw it) the replies should be reparented here.

    Now, FWIW, a brief endorsement of those replies: source format(s) are important; degree of standardization is probably even more so, unless OP is willing to create an exhaustive list of equivalents and close alternates to "recommendation" and to do additional parsing to assure that the extracted information isn't polluted by comments such as "cannot in good conscience recommend (proposal A || proposal B). If the source data is a set of Y/N answers to questions, that's not going to be difficult, but if it's (conversational or formal) prose, then going thru mere thousands of filings is apt to look quite easy.

    And what sort of sample data (other than the actual source data) would provide a thorough test of the code?


    ++$anecdote ne $data

Re: Perl Possibilities
by Ea (Chaplain) on Mar 15, 2016 at 15:33 UTC
    Yes, Perl is one of many languages that can help you with Text Mining and Information Retrieval. It certainly helped with my thesis, but like all tools, it can take some time to learn to use them. What you haven't clearly said is whether you only want to do a bit of parsing/extracting or go full on into Natural Language Processing which is why you were asked for a small sample to give us an idea what you're after. As mentioned above, your baby steps will likely involve regular expressions which is actually a small language all on it's own, but IMHO, Perl has the best regex implementation. You can find a short introduction in Modern Perl. I liked it so much, I bought a paper copy.

    There are a lot of third party libraries available from CPAN that go beyond what's easily achievable with the basic language. Investigate the Lingua::EN modules for some ideas and try not to re-invent the wheel without a good reason. Best of luck,

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.

Re: Will it work?
by CountZero (Bishop) on Mar 15, 2016 at 19:09 UTC
    Perl is certainly the best choice for such a problem, but it is not a magical bullet.

    Perl excels in extracting data from many types of files, but whether there is actually a solution for your problem will less depend on the programming language than on the data you are given. If the data are in a more or less standard format, for instance, the recommendation is always the last sentence or paragraph off the file, then you have a fighting chance to succeed. But if the data is essentially free format then you will first have to solve the problem of natural language parsing and understanding and that is quite a different task!

    That being said, I once had to extract from a database with several hundred of thousand description of claims, those records which concerned temperature damage to temperature controlled cargo in containers. I randomly let Perl choose about 500 records and marked these by hand to be "hit or miss". Then these records and "hit or miss" indications were given to a second Perl script that did a Bayesian analysis (there are modules on CPAN that provide all the basic infrastructure for you) and build a corpus of "hit" and "miss" words. With this corpus and the Bayesian analysis modules the whole database was analyzed and the "hits" identified. A final script extracted a random sample from these results that was checked by hand to see how accurate the process was and to give some statistically founded levels of confidence. If I remember well it had about 5% wrongly categorized records. Not a perfect result, but "good enough" for my purpose then and besides I only had one day to deliver a result.

    Update: added description of a real use case.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Perl Possibilities
by Gideau (Novice) on Mar 15, 2016 at 12:16 UTC

    So in response to the comments on this and also my badly formatted one:

    The files containing the information are quite standardized and in .txt format written in English. Every filing contains the phrase "the board of directors recommends to vote for/against...". Basically I want to match the outcome of the votes (which I already have in a database) being PASS or FAIL with a simple recommendation being YES or NO which I hope to obtain from the filings with Perl.

    Also I'm wondering how the final output of the Perl program could look like. This is because I will have to merge the already existing vote outcome database with their corresponding recommendations. Is this also something that can be done?

    Thanks for the quick replies!

      Please show example .txt ... complicated cases please

      > Is this also something that can be done?

      • Yes
      • No
      • Depends
      ;)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        How exactly do you want me to do that? Not sure if I can upload something here?
      Based on your description, something like this might be a good start. It assumes you can run a pipeline command that feeds the list of file names to the perl script's STDIN - e.g. using a bash shell with standard GNU/linux utils, a command line like this:
      cd {top_level_directory_where_all_files_are_located} find * -type f | search_script.pl > hit_list.txt
      And "search_script.pl" would be something like this:
      #!/usr/bin/perl use strict; use warnings; while (<STDIN>) { chomp; next unless -f(); my $text = do { if ( open( my $fh, $_ )) { local $/; <$fh>; } else { warn "Unable to read $_\n"; } }; if ( $text =~ /\s(recommend\S*\s+to\s+vote\s+\S+)/ ) { ( my $hit = $1 ) =~ s/\s+/ /g; print "$_: $hit\n"; } else { print "$_: NO_MATCH\n"; } }
      Note that in bash you can redirect STDERR as well:
      find * -type f | search_script.pl > hit_list.txt 2> search.errlog
      The output to STDOUT will tell you which files have the sought-for text, and what the text was. It also lists the files that failed to match, so you can take a closer look at those, and tweak the regex as needed.

      The regex proposed above will match all the inflections on "recommend" (-ed, -ing, -s, -ation), and will capture the matched phrase only up to the word that follows "vote". (You can extend the capture to include more words before and/or after, if you like, by adding more \s+\S+\s+ elements inside the parens.)

      When there's a match, all kinds of white-space between words is allowed, and it's all normalized to a single space before output, to ensure one line of output per file.

        This actually really looks like something I can use! Thank you very much!

        I will have to add something in the code that would differentiate the different proposals from each other and match the recommendations and proposals, since there could be more proposals per filing, and thus more recommendations. This, however, is a very very nice start for me:D

Re: Perl Possibilities
by Anonymous Monk on Mar 15, 2016 at 11:42 UTC

    Yes, it's possible!

    The basic approach would be to use one or more regular expressions to extract the information from the source text.

    However, the best way to go about it depends on the format of the source files, could you tell us that? Is it possible for you to provide an example?

    If you haven't yet, you should probably start with perlintro and/or Getting Started with Perl to get an overview of how to work with Perl.

Re: Will it work?
by Marshall (Canon) on Mar 15, 2016 at 21:04 UTC
    Without an example, I have trouble answering your question. However, if the situation is one where a very detail oriented person who knows minimal English could sit there and look at 1,000 papers and summarize the results by extracting certain key phrases, even without knowing exactly what they mean, then the probability is high that a program can be written to do that.

    Programs don't work well with "sort of" or "interpret what you think about this...". "Recommend: Yes/NO" is something that a program can detect. "I'm leaning towards voting Yes, but at this time, I am unsure" is something that a program has close to zero chance of figuring out.

    To have a chance at this, you need to identify some key phrases and a syntax that a very, very literal detailed person could use to extract your info. This very, very literal detailed person (the program) will do its job flawlessly, but only within very strict rules. You could wind up in a situation where the program can do 900 of 1,000 files with a clear result, but yet you wind up with 100 to do manually. This has to do with the "rules" and whether the detailed savant (the program) can tell if it got a valid result or not. I've worked with situations where the program can get to 99.5% with certainty, but for the other 0.5%, it knows that it is not certain.

    Update: 0.5% may not seem like a lot, but if there are 350,000 records, this is a big deal. Try to find some simple rules where you are absolutely certain that the correct result has been found. Then see what that percentage that is. If that is 90%, then you are probably in pretty good shape as the program did 90% of the work! To get something like this completely automated, the program may need to start applying some ad-hoc rules that involve some uncertainty and that means that the program will guess "wrong" some of the time. You have to decide whether that matters or not?

Re: Perl Possibilities
by Gideau (Novice) on Mar 16, 2016 at 12:39 UTC
    Hey guys!

    Thanks so much already for all the helpful comments you've given me. I feel that I'm already making progress on tackling my research!

    So as some of you asked, I will post a small subset of text that contains the information I need. To give some context: it's from a SEC Filing that companies have to do when there are any proposals made by shareholders. As said before, I'm looking for the recommendation of the Board of Directors on these proposals. An example of a filing that I will be using can be found here:

    Filing Example

    The filings differ between different companies in terms of proposals etc. However, in each filing there is this recommendation that I'm looking for that is (almost) always stated in the same manner. I hope this gives some more clarity!

    Thanks again so much for your help. I really appreciate it!

      As the data already is in a fairly tabular format, and in HTML, I would use HTML::TableExtractor to get at the table data. With the data in hand, it should be easy to extract the vote recommendations by looking whether FOR or AGAINST is contained in the relevant column.

        You're right about that indeed. However, the problem is that very few companies use such a table as in the example where they clearly state the proposals and their recommendations, as far as I know.

        Furthermore, I've already downloaded quite a few filings for testing purposes, and they end up being in .txt file however still formatted in html (so you see all the html code in the .txt surrounding the actual text). Would you say it's smarter to keep the .txt or convert back to .html before I do the extraction scripts?

Re: Perl Possibilities
by perlfan (Vicar) on Mar 16, 2016 at 16:44 UTC
    protip: Some people say "Perl" refers to the language and "perl" refers to the binary interpreter. I am in that camp. But everyone agrees that one should never write, "PERL". In fact, if you see that on a job posting or resume, run. Run far away.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1157781]
Approved by Ratazong
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-04-23 14:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found