Takamoto has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

before I reinvent the wheel I want to ask if you know a robust module or procedure to convert PowerPoint to Text in pure Perl (i.e. no OLE, etc.). In Python and co there are some quite robust modules for this. For the moment I came out with the following, which gives me the text file, divided in Slides with basic formatting (just putting together sentences). However, there is a lot of possible formatting in PowerPoint, I guess, and before I start studying their documentations and try to come out with something more general than my script (it doesn't take into considerations Lists, for example, and who knows how many other things), I want to ask for your opinion.

use strict; use warnings; use utf8; use Archive::Zip qw( :ERROR_CODES ); use XML::Twig; use Data::Dumper; my $PathDocument="myDocument.pptx"; our @textPPT; my $zip = Archive::Zip->new(); $zip->read( $PathDocument ) == AZ_OK or die "Unable to open Office + file\n"; my @slides = $zip->membersMatching( "ppt/slides/slide.+\.xml" ); for my $i ( 1 .. scalar @slides ) { push @textPPT, "\n\nSLIDE $i\n\n"; my $content = $zip->contents( "ppt/slides/slide${i}.xml"); my $twig= XML::Twig->new( #keep_encoding=>1, twig_handlers => { 'a:t' => \&text_processing, 'a:endParaRPr' => \&line_processing, 'w:tab' => \&tab_processing, }, ); $twig->parse( $content ); } my $text=join("", @textPPT); #BASIC FORMATTING $text =~ s/ +/ /g; print $text; sub text_processing { my($twig, $ppttext) = @_; push @textPPT, $ppttext->text(); } sub line_processing { my($twig, $ppttext) = @_; push @textPPT, "\n"; } sub tab_processing { my($twig, $ppttext) = @_; push @textPPT, "\t"; }

Replies are listed 'Best First'.
Re: PPT to TXT Pure Perl
by kschwab (Vicar) on Jan 28, 2019 at 12:56 UTC
    Guessing you're talking solely about "PPTX" files, which are XML based, versus "PPT" files that use some other format. I haven't tried it, but here's a perl script that says it extracts text from pptx files.
Re: PPT to TXT Pure Perl
by harangzsolt33 (Deacon) on Feb 03, 2019 at 01:11 UTC
    If you are parsing a "PPT" file, I would approach that problem by writing a perl script that reads the file into a buffer and then scans the buffer for continuous sections of characters (6 or more characters) that only include : 0-9 a-z A-Z \0 \r \n space comma, period, exclamation point, question mark. If any character is outside of this range, then that character is filtered out. Also, if it finds the word ":the#$" by itself alone, then it skips that too since we're looking for at least 6 characters next to each other that fall within the expected range. This would be an easy way to filter out all the binary "trash" that ppt files are filled with. So, I'd start there. Of course, if it's a PPTX file, then you just unzip it and run some type of html or xml filter on the text, and you get the content that way. Easy! ;-)