Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Help with negative look ahed

by eversuhoshin (Sexton)
on Oct 18, 2012 at 19:17 UTC ( [id://999798]=perlquestion: print w/replies, xml ) Need Help??

eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I have been struggling for a while with the negative look ahead regex. I am trying extract certain excerpts from financial statements. Specifically, I would like to extract "Item 1. Business." Here is the file link http://sec.gov/Archives/edgar/data/931683/000115752309002434/a5927574.txt The way I extract the Item 1 section is using boundaries starting from "Item 1" to either "Item 1A," "Item 2", or "Item 3." Unfortunately, the regex does not extract the whole Item 1 because it matches on "Item 3" mentioned in the excerpt, specifically it stops at "discussed more thoroughly in Item 3" I have tried using negative lookahead to make it match all the way but I can't get my code to work. Here is my code.

if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|\:|\-|\,)?\s?(?:1|I)\s?(?:\. +|\-|\:|\,|\-\-)?\s?(?:Description|DESCRIPTION|Discussion|DISCUSSION)? +\s?(?:[Oo]f|OF)?\s?(?:[Tt][Hh][Ee])?\s?(?:Our|OUR)?\s?(?:Busine\s?ss| +BUSINE\s?SS|Company|COMPANY)\s?(?:\.|\-|\:|\-\-|\,)?(.*?)(?!discussed +\s?more\s?in\s?|set\s?forth\s?in\s?|see\s?)(?:Item|ITEM)\s?(?:\.|\-|\ +:|\-\-|\,)?\s?(?:I|1A|1B|2|3)\s?(?:\.|\-|\:|\-\|\,)?/xisg)

It would be great if you can help me figure out a way to make the regex only match the real end, (beginning of item 3) not the words "item 3" inside the excerpts of "item 1." Thank you so much!

Replies are listed 'Best First'.
Re: Help with negative look ahed
by ELISHEVA (Prior) on Oct 19, 2012 at 07:03 UTC

    How stable is the format of this file? Have you done any statistical analysis to test your assumptions? For instance, are section headings always left aligned? Always in caps as in the sample file? There is variability in the dividers between item number and section title (sometimes a colon and sometimes a hyphen). Is this the only variability?

    You mention that sometimes section 3 is found within section 1. Do you mean that section 1 is interrupted by section 3 and then resumes? Or that section 3 immediately follows section 1? If section 1 resumes how do you know as a human reader that you have transitioned from the end of section 3 and back to the remainder of section 1?

    In general using regexes in natural language documents to identify the boundaries of semantic chunks is not very reliable. Regexes are the textual equivalent of hearing sentences in a language you don't know. As a listener you can identify that certain sound sequences occur but if you hear them in two places you have no way of knowing if both are part of a noun or one is part of a verb and another is part of a noun. And even if it turns out both are part of a noun, you don't know whether they mean the same thing because nouns can sometimes have two meanings.

    Using regexes sometimes works if you have a rigid document format and no possibility that markers of section boundaries can occur elsewhere in the document with different meanings and uses. For example, suppose the SEC will only accept documents where (a) the section titles are always marked by the word "ITEM" (all caps) followed by section title section (b) titles never cross line boundaries and are limited to a specific set of values (c) the next line is always a series of hyphens (d) the number of hyphens equals the number of characters in item + title. It would be highly unlikely that such a sequence would appear naturally as part of the regular text of a section. You could then use such a structure to chunk the text.

    On the other hand, if "item" can be lower or upper case and there is no SEC mandated format to titles, then you indeed have a problem because there are many uses of the word "item" even in your sample text. Even if it were true that titles are always left aligned, it wouldn't be enough to pick out the section headings. Since section content text is left aligned, there is a significant possibility that "item" as part of context text will be left aligned in at least some of the SEC files. You'd have to do statistical analysis on the rate of false matches, i.e. comparing your algorithm's extraction to a human reader's extraction. Then you would have to check with your client about its acceptability. If your client thinks there are too many false matches, you'll need to have some mechanism to disambiguate between the different contextual uses of "item" and may need to look into setting up some sort of Baysian filter and training corpus.

Re: Help with negative look ahed
by PrakashK (Pilgrim) on Oct 18, 2012 at 19:32 UTC

    In your particular case, regexes seem to be overkill. Since the text has "ITEM " only in headings and nowhere else, and that seems to be where you want to split on, it's easier to use split:

    use File::Slurp qw<slurp>; my $text = slurp("a5927574.txt"); my @items = split "ITEM ", $text; # print first item, as it is print shift @items; # prepend "ITEM " to the rest and print print "ITEM $_" for @items;

      It seems likely, looking at his regex, that he's insecure about the notion of "ITEM" being capitalized uniformly over his full data set. The sample he provided us is uniform, but why else would he go to all the trouble of creating alternations like (?:Item|ITEM) several times? If he's unable to depend on an all-caps "ITEM" as a delimiter, your solution won't be any more robust than his current one.


      Dave

      Hey PrakashK, thanx for the quick reply. The issue is that I still don't know whether I matched the real end of Item 1. Also, I have bunch of other SEC files that mention Item 3 inside of Item 1. This is the hardest part because I am matching pure text almost. The ones with html, I was able to match more easily. Anyway, thank you for your reply!

Re: Help with negative look ahed
by LanX (Saint) on Oct 18, 2012 at 20:24 UTC
    The chances for helpful answers are much higher, if you help us reading your post.

    Please use markup for formatting.

    Furthermore you could make your regex much clearer by using the x-flag, adding comments and outsourcing parts into well named variables, something like this:

    $item = '(?:Item|ITEM)[Ss]?'; $punctuation = '(?:\.|\-|\:|\-|\,)'; m/ $item \s? $punctuation ... /x;

    don't be too surprised if you can answer your question on your own after refactoring the code! ;-)

    Cheers Rolf

Re: Help with negative look ahed
by Kenosis (Priest) on Oct 18, 2012 at 23:38 UTC

    Hi, eversuhoshin!

    Section items are all capitalized, i.e., "ITEM," but your regex is matching both "ITEM" and "item." The following works to capture only the ITEM I: BUSINESS text from the data you provided:

    use strict; use warnings; use File::Slurp qw/read_file/; my $text = read_file 'a5927574.txt'; my ($businessItemText) = $text =~ /(ITEM [\dA-Z]+?[: -]+BUSINESS.+?)ITEM [\dA-Z]+?[: -]+/s; print $businessItemText;

    Output:

    ITEM I: BUSINESS ---------------- Littlefield Corporation develops, owns and operates charitable bingo +halls, and owns and operates an event rental company. In our Entertainment div +ision, we operate 37 charitable bingo halls in Texas, Alabama, Florida and South + Carolina. ... are with Littlefield Hospitality and twelve (12) are at corporate he +adquarters in Austin, Texas. Littlefield Entertainment consists of sixteen (16) + full time employees and nineteen (19) part time employees. Littlefield H +ospitality consists of thirty-two (32) full time employees and one part time empl +oyee.

    Hope this helps!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://999798]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 06:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found