Re: Extracting Bibliography Citations

A quick and dirty fudge would be :-

my $data='';

while (<>) {
    my $line=$_;
    $line=~s/[\n\r\l]//g;
    if (length($line)<10) {
        $data.=$line;
    } else {
        if ($data=~m/(?:\s[12]\d{3}\s*\)|\spp.\s+)/) {
        print "$data\n\n";
        $data=$line;
    } else {
            $data.=$line;
        }
    }    
}
print "$data\n\n";
[download]

Looks like that catches the other cases
UnderMine

update: added last print to catch final entry

Comment on Re: Extracting Bibliography Citations Download Code

Replies are listed 'Best First'.
Re^2: Extracting Bibliography Citations by Limbic~Region (Chancellor) on Sep 02, 2008 at 15:27 UTC
UnderMine, Interesting. I had abandoned a similar approach because it is possible to get runaway lines (where the year gets mistranslated or the trailing paren is omitted). I may use a variation on this approach where I use year) pp as stop point one, if I feel like I have gone to far back up to pp only, and if that fails, back up to year, and if that fails, provide a max length on a citation. Cheers - L~R	[reply]
Re^3: Extracting Bibliography Citations by UnderMine (Friar) on Sep 02, 2008 at 15:40 UTC
Sounds like a double parse is your best bet then. i.e. use something like the above and then trap for exceptional lines and reparse them to split them again. Corruption is giving you the real headache. Why is the data so corrupt? It sounds like you are using PDFs generated from an OCRs. Have you looked at Tesseract. I have used that when I have needed to train a system to handle OCRing and improving the quality of your source data is always an option. UnderMine	[reply]
Re^4: Extracting Bibliography Citations by Limbic~Region (Chancellor) on Sep 02, 2008 at 17:01 UTC
UnderMine, Actually, I had given up on PDF::OCR - see Re: Extracting content text from PDFs for details because it wouldn't build on my test platform (Win32). If there are seriously better solutions out there, I will give them a whirl - thanks. Unfortunately, the PDFs are what I have to work with and not originals. Cheers - L~R	[reply]


Keep It Simple, Stupid
	PerlMonks