Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: Perl Possibilities

by graff (Chancellor)
on Mar 16, 2016 at 05:37 UTC ( [id://1157910]=note: print w/replies, xml ) Need Help??


in reply to Re: Perl Possibilities
in thread Perl Possibilities

Based on your description, something like this might be a good start. It assumes you can run a pipeline command that feeds the list of file names to the perl script's STDIN - e.g. using a bash shell with standard GNU/linux utils, a command line like this:
cd {top_level_directory_where_all_files_are_located} find * -type f | search_script.pl > hit_list.txt
And "search_script.pl" would be something like this:
#!/usr/bin/perl use strict; use warnings; while (<STDIN>) { chomp; next unless -f(); my $text = do { if ( open( my $fh, $_ )) { local $/; <$fh>; } else { warn "Unable to read $_\n"; } }; if ( $text =~ /\s(recommend\S*\s+to\s+vote\s+\S+)/ ) { ( my $hit = $1 ) =~ s/\s+/ /g; print "$_: $hit\n"; } else { print "$_: NO_MATCH\n"; } }
Note that in bash you can redirect STDERR as well:
find * -type f | search_script.pl > hit_list.txt 2> search.errlog
The output to STDOUT will tell you which files have the sought-for text, and what the text was. It also lists the files that failed to match, so you can take a closer look at those, and tweak the regex as needed.

The regex proposed above will match all the inflections on "recommend" (-ed, -ing, -s, -ation), and will capture the matched phrase only up to the word that follows "vote". (You can extend the capture to include more words before and/or after, if you like, by adding more \s+\S+\s+ elements inside the parens.)

When there's a match, all kinds of white-space between words is allowed, and it's all normalized to a single space before output, to ensure one line of output per file.

Replies are listed 'Best First'.
Re^3: Perl Possibilities
by Gideau (Novice) on Mar 16, 2016 at 12:49 UTC
    This actually really looks like something I can use! Thank you very much!

    I will have to add something in the code that would differentiate the different proposals from each other and match the recommendations and proposals, since there could be more proposals per filing, and thus more recommendations. This, however, is a very very nice start for me:D

      I suggested the code above several hours before you provided a sample of data ("Filing Example" below), so I was unaware that you were dealing with HTML data. That changes things. For example, in HTML, some "whitespace" looks like this:
      ... Board recommends a vote FOR Proposal No.&nbsp;2. ...
      and given the variety of distinct sources (which presumably use distinct HTML/CSS formats and styles), I'd expect a variety of structural differences in the tags that appear in and around the patterns of interest.

      BTW, on the matter of "html" vs. "txt", it doesn't matter what a given file name looks like - what matters is what the content looks like. If the content has HTML tags, it's HTML data, and needs to be treated as such, regardless of what the file name might be.

      If it's typical for texts of this sort to always include a single table near the top of the document that lists the proposals with number, name, and result, it may be that your best bet is Corion's idea about HTML::TableExtractor. It's just a matter of knowing which table in the overall file is the one you want.

      Aside from that, any other practical approach will involve parsing the HTML first to get its plain-text content before you do anything that involves string comparisons or regex matches.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157910]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-24 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found