eversuhoshin has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks I have been struggling for a while with the negative look ahead regex. I am trying extract certain excerpts from financial statements. Specifically, I would like to extract "Item 1. Business." Here is the file link http://sec.gov/Archives/edgar/data/931683/000115752309002434/a5927574.txt The way I extract the Item 1 section is using boundaries starting from "Item 1" to either "Item 1A," "Item 2", or "Item 3." Unfortunately, the regex does not extract the whole Item 1 because it matches on "Item 3" mentioned in the excerpt, specifically it stops at "discussed more thoroughly in Item 3" I have tried using negative lookahead to make it match all the way but I can't get my code to work. Here is my code.
if($data=~m/(?:Item|ITEM)[Ss]?\s?(?:\.|\-|\:|\-|\,)?\s?(?:1|I)\s?(?:\. +|\-|\:|\,|\-\-)?\s?(?:Description|DESCRIPTION|Discussion|DISCUSSION)? +\s?(?:[Oo]f|OF)?\s?(?:[Tt][Hh][Ee])?\s?(?:Our|OUR)?\s?(?:Busine\s?ss| +BUSINE\s?SS|Company|COMPANY)\s?(?:\.|\-|\:|\-\-|\,)?(.*?)(?!discussed +\s?more\s?in\s?|set\s?forth\s?in\s?|see\s?)(?:Item|ITEM)\s?(?:\.|\-|\ +:|\-\-|\,)?\s?(?:I|1A|1B|2|3)\s?(?:\.|\-|\:|\-\|\,)?/xisg)
It would be great if you can help me figure out a way to make the regex only match the real end, (beginning of item 3) not the words "item 3" inside the excerpts of "item 1." Thank you so much!