about retrieving and parsing html without writing on disk

limner has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all brothers monks

i've successfully wrote a perl script that retrieve an html page, parse it and prepare,
at the end a logfile from the html page.

In order to do this, at this moment, the program does the following:

1) unlink the file from disk, if exist on disk
2) retrieve in memory the correct html page
3) write on disk the html page on a standard filename (file.html)
4) read the file on disk (file.html) and parse it
5) write on disk the logfile

What i would like to do is avoid to write the "file.html" on disk and work only
in ram, so i would like to retrieve it, NOT write it on disk, and parse it in memory.

The following are the program lines that do this:

$nomefile="file.html";  ### name of temporary filename
unlink $nomefile;       ### remove the file


$url="http://www.sitename.com/pagespecial.html";
$mech->get($url);
$mech->save_content($nomefile); ### Instr i would like to change


use WWW::Mechanize;
use HTML::TableExtract;
use HTML::Entities;
use Text::Unidecode;

$user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13)
+ Gecko/20101203 Firefox/3.6.13';
my $mech = WWW::Mechanize->new(agent => $user_agent);

my $headers = ['col1', 'col2', 'col3', 'col4', 'col5'];

my $table_extract = HTML::TableExtract->new(headers => $headers);


$table_extract->parse_file($nomefile);  ### Inst i would like to chang
+e
my ($table) = $table_extract->tables;
[download]

Everithing works as i would, but in this way every time i parse a page
i remove and write file.html in order to parse it.

How can i do everithin in memory without writing the file?
Thanks Limner

Comment on about retrieving and parsing html without writing on disk Download Code

Replies are listed 'Best First'.
Re: about retrieving and parsing html without writing on disk by LanX (Saint) on Apr 09, 2018 at 22:15 UTC
hmm, I'm too busy to install the modules, but it's at least possible to `open` a variable for reading and writing. `open my $fh , "<", \$cache` so if you can operate with filehandles instead of files this should work. update HTML::Parser allows `->parse_file($fh)` and even `->parse($string)` update Maybe have a look at `$string = $mech->content(...)` from WWW::Mechanize Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: about retrieving and parsing html without writing on disk by rizzo (Curate) on Apr 10, 2018 at 00:30 UTC
Maybe have a look at $string = $mech->content(...) from WWW::Mechanize and maybe at HTTP::Response as well, because `$mech->get( $uri )` returns an object of that type.	[reply] [d/l]
Re^3: about retrieving and parsing html without writing on disk by Your Mother (Archbishop) on Apr 10, 2018 at 06:10 UTC
Good note for checking `$response->code` and such. Along those lines, for the OP, if you use WWW::Mechanize remember that it fails hard, dies, on any non-success responses, 400s and 500s, unless you set `autocheck => 0`. You also have access to the response object from the mech object with `$mech->response` so you don't necessarily need a new variable for it.	[reply] [d/l] [select]
Re: about retrieving and parsing html without writing on disk by marto (Cardinal) on Apr 11, 2018 at 09:24 UTC
As before, if you post an example of the table then when I get time I'll put together a solution using Mojo::DOM/Mojo::UserAgent.	[reply]
Re: about retrieving and parsing html without writing on disk by learnedbyerror (Monk) on Apr 15, 2018 at 19:03 UTC
The short answer is yes, you can. I don't use the exact parsing utilities that you are using, but I routinely WWW::Mechanize and parse the content Something like the following should work for you. NOTE: I did not test this exact code `use HTML::TableExtract; use WWW::Mechanize; my $user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2. +13)Gecko/20101203 Firefox/3.6.13'; my $mech = WWW::Mechanize->new(autocheck => 0, agent = $user_agent ); if ( $mech->success ) { my $html_string = $mech->content; my $headers = ['col1', 'col2', 'col3', 'col4', 'col5']; my $te = HTML::TableExtract->new( headers => $headers ); my @tables = $te->parse($html_string)->tables; } ...` [download] lbe	[reply] [d/l]

update

update