Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

pdf and ppt to text

by sarvan (Sexton)
on Aug 03, 2011 at 10:30 UTC ( [id://918221]=perlquestion: print w/replies, xml ) Need Help??

sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am working on pdf's and ppt's. Now, i am in a need to extract the text from pdf's and ppt's inorder find some relevancy.

I need a way to extract the text from both forms. I tried the cpan module Text::pdf and other modules too.. But couldnt endup with expected result. The end result is i want text out of pdf and ppt

Can anyone suggest me in this also if i m wrong on my point..

Thanks...!

Replies are listed 'Best First'.
Re: pdf and ppt to text
by moritz (Cardinal) on Aug 03, 2011 at 11:18 UTC
Re: pdf and ppt to text
by zentara (Archbishop) on Aug 03, 2011 at 11:26 UTC
Re: pdf and ppt to text
by LanX (Saint) on Aug 03, 2011 at 11:49 UTC
    I would try to make ppt produce pdf and then process the pdfs.

    you haven't specified which your "expected results" are, so I presume you need not only the text but also positional informations:

    So please see Parsing PDFs by text position? and the referenced older threads for various approaches.

    Cheers Rolf

Re: pdf and ppt to text
by Khen1950fx (Canon) on Aug 04, 2011 at 09:12 UTC
    To get text from a pdf, I use Text::FromAny.
    To get text from a ppt, I use catppt from catdoc.

    Prerequisites =>

    wish from Tcl
    catppt from catdoc

    Module Prerequisites =>

    #!/usr/bin/perl use strict; use warnings; use CPAN; CPAN::Shell->install(qw( XML::Twig Archive::Zip File::Temp Time::Local IO::File Any::Moose Try::Tiny Text::Extract::Word OpenOffice::OODoc File::LibMagic RTF::Parser HTML::FormatText::WithLinks CAM::PDF Text::FromAny));
    Once the prereqs are satisfied, run this:
    #!/usr/bin/perl use strict; use warnings; use File::Fetch; use Text::FromAny; my $ff1 = File::Fetch->new( uri => 'http://cpansearch.perl.org/src/KARMAN/SWISH-Filter-0 +.15/t/test.ppt'); my $ff2 = File::Fetch->new( uri => 'http://cpansearch.perl.org/src/KARMAN/SWISH-Filter-0 +.15/t/test.pdf'); my $where1 = $ff1->fetch( ) or die $ff1->error; my $where2 = $ff2->fetch( ) or die $ff2->error; my $tFromAny= Text::FromAny->new( file => 'test.pdf'); my $text = $tFromAny->text; print $text, "\n"; system("/usr/local/bin/catppt -lV"); print "\n"; system("/usr/local/bin/catppt test.ppt");
      Hi Khen1950fx,

      Thanks for the help.. and when i run dependency installation code "File::LibMagic" installation seems to fail.. So, i tried to install it separately.. even then when i try to run perl MakeFile.PL it shows an error called "cant include magic.h"

      what is the problem here..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://918221]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2024-04-26 07:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found