Dear monks

I need to extract text from a ppt file, just text without any formatting information. I know that pptx files are zip files and that the text is contained in xml files in ppt/slides/

Before I reinvent the wheel (I think the text data in the xml file may be more complex than I think now), maybe it is a task which as been already accomplished, so I'd like to ask if someone can point me to script snippets on the Web. I've already found this which could be a very good starting point, but it is not only perl.

Any suggestions?

UPDATE

I came up with this (partial) solution. Just Perl, also for Windows:

use Archive::Zip qw( :ERROR_CODES ); use Win32::OLE; my $mainfilepath="test.pptx"; my $zip = Archive::Zip->new(); # read the Word document, that is the ZIP file $zip->read( $mainfilepath ) == AZ_OK or die "Unable to open Office fil +e\n"; my $wfh = $zip->extractMember( 'ppt/slides/slide1.xml' ); # Routine for reading file into $content variable. # Source is Perl Monks: http://www.perlmonks.org/?node_id=1952 # { local $/=undef; open (FILE,"<:utf8","ppt/slides/slide1.xml") || (message_error01()) +; binmode FILE; $content = <FILE>; close FILE; } my $nl = "\n"; # Alternative is "\r\n". my $lindent = " "; # Indent nested lists by "\t", " " etc. my $lwidth = 80; # Line width, used for short line justification. # ToDo: Better list handling. Currently assumed 8 level nesting. my @levchar = ('*', '+', 'o', '-', '**', '++', 'oo', '--'); # # Text extraction starts. # $content =~ s/<?xml .*?\?>(\r)?\n//; $content =~ s{<w:p [^/>]+?/>|</w:p>}|$nl|og; $content =~ s|<w:br/>|$nl|og; $content =~ s|<w:tab/>|\t|og; my $hr = '-' x 78 . $nl; $content =~ s|<w:pBdr>.*?</w:pBdr>|$hr|og; $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . "$le +vchar[$1] "|oge; # # Uncomment either of below two lines and comment above line, if deali +ng # with more than 8 level nested lists. # # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . '* + '|oge; # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|'*' x ($1+1) . ' ' +|oge; $content =~ s{<w:caps/>.*?(<w:t>|<w:t [^>]+>)(.*?)</w:t>}/uc $2/oge; $content =~ s{<w:pPr><w:jc w:val="center"/></w:pPr><w:r><w:t>(.*?)</w: +t></w:r>}/cjustify($1)/oge; $content =~ s{<w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:t>(.*?)</w:t +></w:r>}/rjustify($1)/oge; $content =~ s{<w:hyperlink r:id="(.*?)".*?>(.*?)</w:hyperlink>}/hyperl +ink($1,$2)/oge; #$content =~ s/<.*?>//g; $content =~ s/<.*?>/ /g; #Substitute tags with white spaces, in order +to always have white spaces between words # # Convert non-ASCII characters/character sequences to ASCII characters +. # # $content =~ s/\xE2\x82\xAC/\xC8/og; # euro symbol as saved by MSO +ffice $content =~ s/\xE2\x82\xAC/E/og; # euro symbol expressed as E $content =~ s/\xE2\x80\xA6/.../og; $content =~ s/\xE2\x80\xA2/::/og; # four dot diamond symbol $content =~ s/\xE2\x80\x9C/"/og; $content =~ s/\xE2\x80\x99/'/og; $content =~ s/\xE2\x80\x98/'/og; $content =~ s/\xE2\x80\x93/-/og; $content =~ s/\xC2\xA0//og; $content =~ s/&amp;/&/ogi; $content =~ s/&lt;/</ogi; $content =~ s/&gt;/>/ogi; $content =~ s/\s+/ /g;#delete extra white spaces my @row = split(/\n/, $content);

The script extracts the content of the first slide (I must change the initial part to reiterate for any slide) and find a way to insert a separator between two sections (fields) of a slide, as words are now without any white space between them.


In reply to Extracting text from pptx by welle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.