Help with negative look ahed

eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I have been struggling for a while with the negative look ahead regex. I am trying extract certain excerpts from financial statements. Specifically, I would like to extract "Item 1. Business." Here is the file link http://sec.gov/Archives/edgar/data/931683/000115752309002434/a5927574.txt The way I extract the Item 1 section is using boundaries starting from "Item 1" to either "Item 1A," "Item 2", or "Item 3." Unfortunately, the regex does not extract the whole Item 1 because it matches on "Item 3" mentioned in the excerpt, specifically it stops at "discussed more thoroughly in Item 3" I have tried using negative lookahead to make it match all the way but I can't get my code to work. Here is my code.

if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|\:|\-|\,)?\s?(?:1|I)\s?(?:\.
+|\-|\:|\,|\-\-)?\s?(?:Description|DESCRIPTION|Discussion|DISCUSSION)?
+\s?(?:[Oo]f|OF)?\s?(?:[Tt][Hh][Ee])?\s?(?:Our|OUR)?\s?(?:Busine\s?ss|
+BUSINE\s?SS|Company|COMPANY)\s?(?:\.|\-|\:|\-\-|\,)?(.*?)(?!discussed
+\s?more\s?in\s?|set\s?forth\s?in\s?|see\s?)(?:Item|ITEM)\s?(?:\.|\-|\
+:|\-\-|\,)?\s?(?:I|1A|1B|2|3)\s?(?:\.|\-|\:|\-\|\,)?/xisg)
[download]

It would be great if you can help me figure out a way to make the regex only match the real end, (beginning of item 3) not the words "item 3" inside the excerpts of "item 1." Thank you so much!

Comment on Help with negative look ahed Download Code

Replies are listed 'Best First'.
Re: Help with negative look ahed by ELISHEVA (Prior) on Oct 19, 2012 at 07:03 UTC
How stable is the format of this file? Have you done any statistical analysis to test your assumptions? For instance, are section headings always left aligned? Always in caps as in the sample file? There is variability in the dividers between item number and section title (sometimes a colon and sometimes a hyphen). Is this the only variability? You mention that sometimes section 3 is found within section 1. Do you mean that section 1 is interrupted by section 3 and then resumes? Or that section 3 immediately follows section 1? If section 1 resumes how do you know as a human reader that you have transitioned from the end of section 3 and back to the remainder of section 1? In general using regexes in natural language documents to identify the boundaries of semantic chunks is not very reliable. Regexes are the textual equivalent of hearing sentences in a language you don't know. As a listener you can identify that certain sound sequences occur but if you hear them in two places you have no way of knowing if both are part of a noun or one is part of a verb and another is part of a noun. And even if it turns out both are part of a noun, you don't know whether they mean the same thing because nouns can sometimes have two meanings. Using regexes sometimes works if you have a rigid document format and no possibility that markers of section boundaries can occur elsewhere in the document with different meanings and uses. For example, suppose the SEC will only accept documents where (a) the section titles are always marked by the word "ITEM" (all caps) followed by section title section (b) titles never cross line boundaries and are limited to a specific set of values (c) the next line is always a series of hyphens (d) the number of hyphens equals the number of characters in item + title. It would be highly unlikely that such a sequence would appear naturally as part of the regular text of a section. You could then use such a structure to chunk the text. On the other hand, if "item" can be lower or upper case and there is no SEC mandated format to titles, then you indeed have a problem because there are many uses of the word "item" even in your sample text. Even if it were true that titles are always left aligned, it wouldn't be enough to pick out the section headings. Since section content text is left aligned, there is a significant possibility that "item" as part of context text will be left aligned in at least some of the SEC files. You'd have to do statistical analysis on the rate of false matches, i.e. comparing your algorithm's extraction to a human reader's extraction. Then you would have to check with your client about its acceptability. If your client thinks there are too many false matches, you'll need to have some mechanism to disambiguate between the different contextual uses of "item" and may need to look into setting up some sort of Baysian filter and training corpus.	[reply]
Re: Help with negative look ahed by PrakashK (Pilgrim) on Oct 18, 2012 at 19:32 UTC
In your particular case, regexes seem to be overkill. Since the text has "ITEM " only in headings and nowhere else, and that seems to be where you want to split on, it's easier to use `split`: `use File::Slurp qw<slurp>; my $text = slurp("a5927574.txt"); my @items = split "ITEM ", $text; # print first item, as it is print shift @items; # prepend "ITEM " to the rest and print print "ITEM $_" for @items;` [download]	[reply] [d/l] [select]
Re^2: Help with negative look ahed by davido (Cardinal) on Oct 19, 2012 at 06:57 UTC
It seems likely, looking at his regex, that he's insecure about the notion of "ITEM" being capitalized uniformly over his full data set. The sample he provided us is uniform, but why else would he go to all the trouble of creating alternations like `(?:Item\|ITEM)` several times? If he's unable to depend on an all-caps "ITEM" as a delimiter, your solution won't be any more robust than his current one. Dave	[reply] [d/l]
Re^2: Help with negative look ahed by eversuhoshin (Sexton) on Oct 18, 2012 at 19:55 UTC
Hey PrakashK, thanx for the quick reply. The issue is that I still don't know whether I matched the real end of Item 1. Also, I have bunch of other SEC files that mention Item 3 inside of Item 1. This is the hardest part because I am matching pure text almost. The ones with html, I was able to match more easily. Anyway, thank you for your reply!	[reply]
Re: Help with negative look ahed by LanX (Saint) on Oct 18, 2012 at 20:24 UTC
The chances for helpful answers are much higher, if you help us reading your post. Please use markup for formatting. Furthermore you could make your regex much clearer by using the x-flag, adding comments and outsourcing parts into well named variables, something like this: `$item = '(?:Item\|ITEM)[Ss]?'; $punctuation = '(?:\.\|\-\|\:\|\-\|\,)'; m/ $item \s? $punctuation ... /x;` [download] don't be too surprised if you can answer your question on your own after refactoring the code! ;-) Cheers Rolf	[reply] [d/l]
Re: Help with negative look ahed by Kenosis (Priest) on Oct 18, 2012 at 23:38 UTC
Hi, eversuhoshin! Section items are all capitalized, i.e., "ITEM," but your regex is matching both "ITEM" and "item." The following works to capture only the ITEM I: BUSINESS text from the data you provided: `use strict; use warnings; use File::Slurp qw/read_file/; my $text = read_file 'a5927574.txt'; my ($businessItemText) = $text =~ /(ITEM [\dA-Z]+?[: -]+BUSINESS.+?)ITEM [\dA-Z]+?[: -]+/s; print $businessItemText;` [download] Output: ITEM I: BUSINESS ---------------- Littlefield Corporation develops, owns and operates charitable bingo +halls, and owns and operates an event rental company. In our Entertainment div +ision, we operate 37 charitable bingo halls in Texas, Alabama, Florida and South + Carolina. ... are with Littlefield Hospitality and twelve (12) are at corporate he +adquarters in Austin, Texas. Littlefield Entertainment consists of sixteen (16) + full time employees and nineteen (19) part time employees. Littlefield H +ospitality consists of thirty-two (32) full time employees and one part time empl +oyee. [download] Hope this helps!	[reply] [d/l] [select]


P is for Practical
	PerlMonks