Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Extracting Bibliography Citations

by UnderMine (Friar)
on Sep 02, 2008 at 15:15 UTC ( [id://708512]=note: print w/replies, xml ) Need Help??


in reply to Extracting Bibliography Citations

A quick and dirty fudge would be :-
my $data=''; while (<>) { my $line=$_; $line=~s/[\n\r\l]//g; if (length($line)<10) { $data.=$line; } else { if ($data=~m/(?:\s[12]\d{3}\s*\)|\spp.\s+)/) { print "$data\n\n"; $data=$line; } else { $data.=$line; } } } print "$data\n\n";
Looks like that catches the other cases
UnderMine

update: added last print to catch final entry

Replies are listed 'Best First'.
Re^2: Extracting Bibliography Citations
by Limbic~Region (Chancellor) on Sep 02, 2008 at 15:27 UTC
    UnderMine,
    Interesting. I had abandoned a similar approach because it is possible to get runaway lines (where the year gets mistranslated or the trailing paren is omitted). I may use a variation on this approach where I use year) pp as stop point one, if I feel like I have gone to far back up to pp only, and if that fails, back up to year, and if that fails, provide a max length on a citation.

    Cheers - L~R

      Sounds like a double parse is your best bet then. i.e. use something like the above and then trap for exceptional lines and reparse them to split them again.

      Corruption is giving you the real headache. Why is the data so corrupt? It sounds like you are using PDFs generated from an OCRs. Have you looked at Tesseract. I have used that when I have needed to train a system to handle OCRing and improving the quality of your source data is always an option.

      UnderMine
        UnderMine,
        Actually, I had given up on PDF::OCR - see Re: Extracting content text from PDFs for details because it wouldn't build on my test platform (Win32). If there are seriously better solutions out there, I will give them a whirl - thanks. Unfortunately, the PDFs are what I have to work with and not originals.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://708512]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-19 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found