All These Files - Am I Thinking About This Right?

vacant has asked for the wisdom of the Perl Monks concerning the following question:

Greetings to the esteemed monks. I'm back after a decade, writing perl code again. Woe is me.

I hope this isn't too general a question, but I'm trying to avoid rediscovering fire.

I have about 10,000 files, most of them image-only scans of printed documents in .pdf format, and I need to do two things. I need to build a helper program to partially or entirely concoct appropriate file names based upon some of the content to suggest to a user who will either make changes or accept the perl-generated results, then I have to put all these files on a password-protected web site and make the whole mess searchable. Here is my best guess at how to do these things.

1. Extract the image from each file using ImageMagic, then turn it into a separate, but linked, text file using Tessaract to perform OCR.
2. Now, I can use the text file as input to my renaming assistant which will use regular expressions to identify keywords.
3. Then, I can store the OCR text and the linked original image in a MySQL database on the web site, and use SQL commands to do string searches as users request in a HTML search box.

I can write the perl code all right, but I'm not sure if this is the best, or right, way to set up the project. Is there as better approach?

Oh, and I have looked at several online custom search vendors, but the need for security of the data, and the inability of those vendors to search password-protected data probably rules out that approach, I am sad to discover.

Comment on All These Files - Am I Thinking About This Right?

Replies are listed 'Best First'.
Re: All These Files - Am I Thinking About This Right? by marto (Cardinal) on Apr 04, 2019 at 10:03 UTC
One of the more interesting things on my `$work` 'to do' list is something very similar, but for a much higher number of documents. I plan to investigate using Elasticsearch, combined with a Mojolicious front end for querying/displaying results. I'm only at the state of initial investigations into Elasticsearch, and this is to replace a legacy solution which is showing its age.	[reply] [d/l]
Re: All These Files - Am I Thinking About This Right? by bliako (Monsignor) on Apr 04, 2019 at 12:52 UTC
the weakest link is OCR. But if you are only interested in keywords (as opposed to the complete text) then even if OCR's output is incomplete, there are probabilistic methods to complete (and even validate) the OCR'd keyword. If you want to adjust these methods to your context then you need to manually convert some representative set of documents to text (or manually correct OCR's output for those documents only) and feed that to your methods. That assumes (enough) documents belong to a single context, e.g. legal or spy reports, I guess. Once you have the document text, there are various open-source search frameworks to use, as marto mentions, and it will be free-wheeling from there on. What I would not do is form the filename from keywords. I would rather give each file a unique number id. Then use your already implemented search engine to search. If your documents are already indexed on some keywords, e.g. `Report 5,5/12/12,ABC.vs.XYZ` then, optionally, process it and insert that into DB too to be used to enhance your search engine.	[reply] [d/l]
Re^2: All These Files - Am I Thinking About This Right? by marto (Cardinal) on Apr 04, 2019 at 12:58 UTC
"...Once you have the document text, there are various open-source search frameworks to use... Elasticsearch has plugins such as fscrawler to deal with all that for you.	[reply]
Re^3: All These Files - Am I Thinking About This Right? by bliako (Monsignor) on Apr 04, 2019 at 13:34 UTC
Cool then. It uses Tesseract too. Though, I do not know how effective an automated solution will be as opposed to manually tuning or re-training Tesseract for scanned, old documents.	[reply]
Re^4: All These Files - Am I Thinking About This Right? by marto (Cardinal) on Apr 04, 2019 at 14:01 UTC
Re: All These Files - Am I Thinking About This Right? by karlgoethebier (Abbot) on Apr 04, 2019 at 15:50 UTC
BTW. Don‘t know if it’s still alive and well. Best regards, Karl «The Crux of the Biscuit is the Apostrophe» `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re: All These Files - Am I Thinking About This Right? by Jenda (Abbot) on Apr 07, 2019 at 10:17 UTC
I don't think MySQL is the right kind of database for this. I'd stuff the texts in SOLR or some other fulltext search. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: All These Files - Am I Thinking About This Right? by jimpudar (Pilgrim) on Apr 07, 2019 at 20:11 UTC
Honestly, with only 10k files I'd probably do steps 1 and 2 and then use ripgrep or the silver searcher to do the string searches. If that ends up being too slow, you could use a bunch of different already mentioned tools to speed up the process. πάντων χρημάτων μέτρον έστιν άνθρωπος.	[reply]


Perl Monk, Perl Meditation
	PerlMonks