in reply to Browser automation to copy webpage to text

Hello eversuhoshin,

I would like to copy what I see ... to a word or text file, exactly as I see it on the browser

The requirement is unclear, expecially as an HTML page contains markup which can’t be translated into plain text.

Here is a plain text approach using LWP::Simple to get the web page and HTML::FormatText to extract the text from the HTML:

#! perl use strict; use warnings; use HTML::FormatText; use LWP::Simple; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; my $string = HTML::FormatText->format_string ( $content, leftmargin => 5, rightmargin => 75, ); print $string;

Output (opening lines only):

13:44 >perl 1415_SoPW.pl Wide character in print at 1415_SoPW.pl line 19. S-1 1 iogcs1-9132012.htm INFINITY OIL & GAS COMPANY FORM S-1 (9/13/201 +2). Registration No. _________________________ ----------------------------------------------------------------- +----- ----------------------------------------------------------------- +----- SECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549_________ +________FO RM S-1REGISTRATION STATEMENT UNDERTHE SECURITIES ACT OF 1933 INFINITY OIL & GAS COMPANY(Name of small business issuer in its charter) Nevada 1081 (State or Other Jurisdiction of Organization) (Primary Standard Industrial Classification Code) _________________ 750 Broadway

I’m not sure whether that output suits your needs? You could also look at HTML::HTML5::ToText.

To produce a Word-readable file, change HTML::FormatText to HTML::FormatRTF:

use strict; use warnings; use HTML::FormatRTF; use LWP::Simple; my $outfile = 'test.rtf'; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; open(my $rtf, '>', $outfile) or die "Cannot open file '$outfile' for writing: $!"; print $rtf HTML::FormatRTF->format_string($content); close $rtf or die "Cannot close file '$outfile': $!";

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Browser automation to copy webpage to text
by eversuhoshin (Sexton) on Oct 21, 2015 at 04:59 UTC

    thank you so much! this is very helpful

    would there be a way for me to save the entire webpage as a pdf instead of an rtf?

    I realize even with rtf, some formats are broken

    Ideally, I would like the webpage to be saved in pdf and then copied to microsoft word

    Again, thank you so much

      For a Perl solution, you can try PDF::FromHTML — if you can get it to install. :-(

      For automated, non-Perl solutions, you can look at something like HTMLDOC (free, but you have to build it from source), or Doxillion Document Converter (not free).

      But you’ll probably get the best results by manually saving (or “printing”) the page to PDF format in your browser. For example, in Google Chrome select Print..., then under Destination click the Change button and select Save as PDF. In Firefox, install the “Save as PDF” add-on which places a Save as PDF by pdfcrown.com button on the address bar.

      You may be able to automate this browser-based approach from Perl via a module such as WWW::Mechanize::Firefox; but that’s way outside my experience.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,