in reply to Browser automation to copy webpage to text
Hello eversuhoshin,
I would like to copy what I see ... to a word or text file, exactly as I see it on the browser
The requirement is unclear, expecially as an HTML page contains markup which can’t be translated into plain text.
Here is a plain text approach using LWP::Simple to get the web page and HTML::FormatText to extract the text from the HTML:
#! perl use strict; use warnings; use HTML::FormatText; use LWP::Simple; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; my $string = HTML::FormatText->format_string ( $content, leftmargin => 5, rightmargin => 75, ); print $string;
Output (opening lines only):
13:44 >perl 1415_SoPW.pl Wide character in print at 1415_SoPW.pl line 19. S-1 1 iogcs1-9132012.htm INFINITY OIL & GAS COMPANY FORM S-1 (9/13/201 +2). Registration No. _________________________ ----------------------------------------------------------------- +----- ----------------------------------------------------------------- +----- SECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549_________ +________FO RM S-1REGISTRATION STATEMENT UNDERTHE SECURITIES ACT OF 1933 INFINITY OIL & GAS COMPANY(Name of small business issuer in its charter) Nevada 1081 (State or Other Jurisdiction of Organization) (Primary Standard Industrial Classification Code) _________________ 750 Broadway
I’m not sure whether that output suits your needs? You could also look at HTML::HTML5::ToText.
To produce a Word-readable file, change HTML::FormatText to HTML::FormatRTF:
use strict; use warnings; use HTML::FormatRTF; use LWP::Simple; my $outfile = 'test.rtf'; my $address = 'http://www.sec.gov/Archives/edgar/data/1557421/' . '000100201412000509/iogcs1-9132012.htm'; my $content = get($address); defined $content or die "Cannot read '$address': $!"; open(my $rtf, '>', $outfile) or die "Cannot open file '$outfile' for writing: $!"; print $rtf HTML::FormatRTF->format_string($content); close $rtf or die "Cannot close file '$outfile': $!";
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Browser automation to copy webpage to text
by eversuhoshin (Sexton) on Oct 21, 2015 at 04:59 UTC | |
by Athanasius (Archbishop) on Oct 21, 2015 at 07:57 UTC |