Re^2: Perl Possibilities

Based on your description, something like this might be a good start. It assumes you can run a pipeline command that feeds the list of file names to the perl script's STDIN - e.g. using a bash shell with standard GNU/linux utils, a command line like this:

cd {top_level_directory_where_all_files_are_located}

find * -type f | search_script.pl > hit_list.txt
[download]

And "search_script.pl" would be something like this:

#!/usr/bin/perl

use strict;
use warnings;

while (<STDIN>) {
    chomp;
    next unless -f();
    my $text = do {
        if ( open( my $fh, $_ )) {
            local $/;
            <$fh>;
        }
        else {
            warn "Unable to read $_\n";
        }
    };
    if ( $text =~ /\s(recommend\S*\s+to\s+vote\s+\S+)/ ) {
        ( my $hit = $1 ) =~ s/\s+/ /g;
        print "$_: $hit\n";
    }
    else {
        print "$_: NO_MATCH\n";
    }
}
[download]

Note that in bash you can redirect STDERR as well:

find * -type f | search_script.pl > hit_list.txt 2> search.errlog
[download]

The output to STDOUT will tell you which files have the sought-for text, and what the text was. It also lists the files that failed to match, so you can take a closer look at those, and tweak the regex as needed.

The regex proposed above will match all the inflections on "recommend" (-ed, -ing, -s, -ation), and will capture the matched phrase only up to the word that follows "vote". (You can extend the capture to include more words before and/or after, if you like, by adding more \s+\S+\s+ elements inside the parens.)

When there's a match, all kinds of white-space between words is allowed, and it's all normalized to a single space before output, to ensure one line of output per file.

Comment on Re^2: Perl Possibilities Select or Download Code

Replies are listed 'Best First'.
Re^3: Perl Possibilities by Gideau (Novice) on Mar 16, 2016 at 12:49 UTC
This actually really looks like something I can use! Thank you very much! I will have to add something in the code that would differentiate the different proposals from each other and match the recommendations and proposals, since there could be more proposals per filing, and thus more recommendations. This, however, is a very very nice start for me:D	[reply]
Re^4: Perl Possibilities by graff (Chancellor) on Mar 18, 2016 at 22:05 UTC
I suggested the code above several hours before you provided a sample of data ("Filing Example" below), so I was unaware that you were dealing with HTML data. That changes things. For example, in HTML, some "whitespace" looks like this: `... Board recommends a vote FOR Proposal No. 2. ...` [download] and given the variety of distinct sources (which presumably use distinct HTML/CSS formats and styles), I'd expect a variety of structural differences in the tags that appear in and around the patterns of interest. BTW, on the matter of "html" vs. "txt", it doesn't matter what a given file name looks like - what matters is what the content looks like. If the content has HTML tags, it's HTML data, and needs to be treated as such, regardless of what the file name might be. If it's typical for texts of this sort to always include a single table near the top of the document that lists the proposals with number, name, and result, it may be that your best bet is Corion's idea about HTML::TableExtractor. It's just a matter of knowing which table in the overall file is the one you want. Aside from that, any other practical approach will involve parsing the HTML first to get its plain-text content before you do anything that involves string comparisons or regex matches.	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks