Takamoto has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks
before I reinvent the wheel I want to ask if you know a robust module or procedure to convert PowerPoint to Text in pure Perl (i.e. no OLE, etc.). In Python and co there are some quite robust modules for this. For the moment I came out with the following, which gives me the text file, divided in Slides with basic formatting (just putting together sentences). However, there is a lot of possible formatting in PowerPoint, I guess, and before I start studying their documentations and try to come out with something more general than my script (it doesn't take into considerations Lists, for example, and who knows how many other things), I want to ask for your opinion.
use strict; use warnings; use utf8; use Archive::Zip qw( :ERROR_CODES ); use XML::Twig; use Data::Dumper; my $PathDocument="myDocument.pptx"; our @textPPT; my $zip = Archive::Zip->new(); $zip->read( $PathDocument ) == AZ_OK or die "Unable to open Office + file\n"; my @slides = $zip->membersMatching( "ppt/slides/slide.+\.xml" ); for my $i ( 1 .. scalar @slides ) { push @textPPT, "\n\nSLIDE $i\n\n"; my $content = $zip->contents( "ppt/slides/slide${i}.xml"); my $twig= XML::Twig->new( #keep_encoding=>1, twig_handlers => { 'a:t' => \&text_processing, 'a:endParaRPr' => \&line_processing, 'w:tab' => \&tab_processing, }, ); $twig->parse( $content ); } my $text=join("", @textPPT); #BASIC FORMATTING $text =~ s/ +/ /g; print $text; sub text_processing { my($twig, $ppttext) = @_; push @textPPT, $ppttext->text(); } sub line_processing { my($twig, $ppttext) = @_; push @textPPT, "\n"; } sub tab_processing { my($twig, $ppttext) = @_; push @textPPT, "\t"; }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: PPT to TXT Pure Perl
by kschwab (Vicar) on Jan 28, 2019 at 12:56 UTC | |
Re: PPT to TXT Pure Perl
by harangzsolt33 (Deacon) on Feb 03, 2019 at 01:11 UTC | |
by haukex (Archbishop) on Feb 03, 2019 at 12:15 UTC |