Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

pdf and ppt to text

by sarvan (Sexton)
on Aug 03, 2011 at 10:30 UTC ( #918221=perlquestion: print w/replies, xml ) Need Help??

sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am working on pdf's and ppt's. Now, i am in a need to extract the text from pdf's and ppt's inorder find some relevancy.

I need a way to extract the text from both forms. I tried the cpan module Text::pdf and other modules too.. But couldnt endup with expected result. The end result is i want text out of pdf and ppt

Can anyone suggest me in this also if i m wrong on my point..

Thanks...!

Replies are listed 'Best First'.
Re: pdf and ppt to text
by moritz (Cardinal) on Aug 03, 2011 at 11:18 UTC
Re: pdf and ppt to text
by zentara (Archbishop) on Aug 03, 2011 at 11:26 UTC
Re: pdf and ppt to text
by LanX (Saint) on Aug 03, 2011 at 11:49 UTC
    I would try to make ppt produce pdf and then process the pdfs.

    you haven't specified which your "expected results" are, so I presume you need not only the text but also positional informations:

    So please see Parsing PDFs by text position? and the referenced older threads for various approaches.

    Cheers Rolf

Re: pdf and ppt to text
by Khen1950fx (Canon) on Aug 04, 2011 at 09:12 UTC
    To get text from a pdf, I use Text::FromAny.
    To get text from a ppt, I use catppt from catdoc.

    Prerequisites =>

    wish from Tcl
    catppt from catdoc

    Module Prerequisites =>

    #!/usr/bin/perl use strict; use warnings; use CPAN; CPAN::Shell->install(qw( XML::Twig Archive::Zip File::Temp Time::Local IO::File Any::Moose Try::Tiny Text::Extract::Word OpenOffice::OODoc File::LibMagic RTF::Parser HTML::FormatText::WithLinks CAM::PDF Text::FromAny));
    Once the prereqs are satisfied, run this:
    #!/usr/bin/perl use strict; use warnings; use File::Fetch; use Text::FromAny; my $ff1 = File::Fetch->new( uri => 'http://cpansearch.perl.org/src/KARMAN/SWISH-Filter-0 +.15/t/test.ppt'); my $ff2 = File::Fetch->new( uri => 'http://cpansearch.perl.org/src/KARMAN/SWISH-Filter-0 +.15/t/test.pdf'); my $where1 = $ff1->fetch( ) or die $ff1->error; my $where2 = $ff2->fetch( ) or die $ff2->error; my $tFromAny= Text::FromAny->new( file => 'test.pdf'); my $text = $tFromAny->text; print $text, "\n"; system("/usr/local/bin/catppt -lV"); print "\n"; system("/usr/local/bin/catppt test.ppt");
      Hi Khen1950fx,

      Thanks for the help.. and when i run dependency installation code "File::LibMagic" installation seems to fail.. So, i tried to install it separately.. even then when i try to run perl MakeFile.PL it shows an error called "cant include magic.h"

      what is the problem here..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://918221]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2023-09-25 09:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?