Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

PPT to TXT Pure Perl

by Takamoto (Monk)
on Jan 28, 2019 at 11:39 UTC ( [id://1229059]=perlquestion: print w/replies, xml ) Need Help??

Takamoto has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

before I reinvent the wheel I want to ask if you know a robust module or procedure to convert PowerPoint to Text in pure Perl (i.e. no OLE, etc.). In Python and co there are some quite robust modules for this. For the moment I came out with the following, which gives me the text file, divided in Slides with basic formatting (just putting together sentences). However, there is a lot of possible formatting in PowerPoint, I guess, and before I start studying their documentations and try to come out with something more general than my script (it doesn't take into considerations Lists, for example, and who knows how many other things), I want to ask for your opinion.

use strict; use warnings; use utf8; use Archive::Zip qw( :ERROR_CODES ); use XML::Twig; use Data::Dumper; my $PathDocument="myDocument.pptx"; our @textPPT; my $zip = Archive::Zip->new(); $zip->read( $PathDocument ) == AZ_OK or die "Unable to open Office + file\n"; my @slides = $zip->membersMatching( "ppt/slides/slide.+\.xml" ); for my $i ( 1 .. scalar @slides ) { push @textPPT, "\n\nSLIDE $i\n\n"; my $content = $zip->contents( "ppt/slides/slide${i}.xml"); my $twig= XML::Twig->new( #keep_encoding=>1, twig_handlers => { 'a:t' => \&text_processing, 'a:endParaRPr' => \&line_processing, 'w:tab' => \&tab_processing, }, ); $twig->parse( $content ); } my $text=join("", @textPPT); #BASIC FORMATTING $text =~ s/ +/ /g; print $text; sub text_processing { my($twig, $ppttext) = @_; push @textPPT, $ppttext->text(); } sub line_processing { my($twig, $ppttext) = @_; push @textPPT, "\n"; } sub tab_processing { my($twig, $ppttext) = @_; push @textPPT, "\t"; }

Replies are listed 'Best First'.
Re: PPT to TXT Pure Perl
by kschwab (Vicar) on Jan 28, 2019 at 12:56 UTC
    Guessing you're talking solely about "PPTX" files, which are XML based, versus "PPT" files that use some other format. I haven't tried it, but here's a perl script that says it extracts text from pptx files.
Re: PPT to TXT Pure Perl
by harangzsolt33 (Chaplain) on Feb 03, 2019 at 01:11 UTC
    If you are parsing a "PPT" file, I would approach that problem by writing a perl script that reads the file into a buffer and then scans the buffer for continuous sections of characters (6 or more characters) that only include : 0-9 a-z A-Z \0 \r \n space comma, period, exclamation point, question mark. If any character is outside of this range, then that character is filtered out. Also, if it finds the word ":the#$" by itself alone, then it skips that too since we're looking for at least 6 characters next to each other that fall within the expected range. This would be an easy way to filter out all the binary "trash" that ppt files are filled with. So, I'd start there. Of course, if it's a PPTX file, then you just unzip it and run some type of html or xml filter on the text, and you get the content that way. Easy! ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1229059]
Approved by marto
Front-paged by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-16 10:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found