Dear Monks
I have been struggling for a while with the negative look ahead regex. I am trying extract certain excerpts from financial statements. Specifically, I would like to extract "Item 1. Business."
Here is the file link
http://sec.gov/Archives/edgar/data/931683/000115752309002434/a5927574.txt
The way I extract the Item 1 section is using boundaries starting from "Item 1" to either "Item 1A," "Item 2", or "Item 3." Unfortunately, the regex does not extract the whole Item 1 because it matches on "Item 3" mentioned in the excerpt, specifically it stops at "discussed more thoroughly in Item 3"
I have tried using negative lookahead to make it match all the way but I can't get my code to work.
Here is my code.
if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|\:|\-|\,)?\s?(?:1|I)\s?(?:\.
+|\-|\:|\,|\-\-)?\s?(?:Description|DESCRIPTION|Discussion|DISCUSSION)?
+\s?(?:[Oo]f|OF)?\s?(?:[Tt][Hh][Ee])?\s?(?:Our|OUR)?\s?(?:Busine\s?ss|
+BUSINE\s?SS|Company|COMPANY)\s?(?:\.|\-|\:|\-\-|\,)?(.*?)(?!discussed
+\s?more\s?in\s?|set\s?forth\s?in\s?|see\s?)(?:Item|ITEM)\s?(?:\.|\-|\
+:|\-\-|\,)?\s?(?:I|1A|1B|2|3)\s?(?:\.|\-|\:|\-\|\,)?/xisg)
It would be great if you can help me figure out a way to make the regex only match the real end, (beginning of item 3) not the words "item 3" inside the excerpts of "item 1."
Thank you so much!