HTML::TableExtract - ugly - is there better way?

rtwolfe has asked for the wisdom of the Perl Monks concerning the following question:

Am a beginner so be 'gentle'. Have built a program to pull data from a NASDAQ page for certain stocks. HTML::TableExtract looked like a slick way to go. Nasdaq will tweak this page occasionally and thought by manipulating headers and the count, I could easily keep the script working. And yes, this current script does accomplish my goal. But thought this was a 'teachable moment' to learn more about Perl.

I put this line in a windows batch file to dump the page data to a file for processing.

perl -e "use LWP::Simple; getprint('http://www.nasdaq.com/extended-tra
+ding/premarket-mostactive.aspx')" >> nasdaq-stocks.txt
[download]

My biggest issue is that when I pull the HTML file data through HTML::TableExtract, there is a lot of clean up work to get the tab delimited format wanted. But to get there, I fell back to writing to files and parsing / substituting rows/lines until I got in the final format.

Here's what the final (good) output looks like.

AMRI    $13.53    $14.75    9.02%    6,984
AUPH    $6.59    $7.07    7.28%    632,035
ATEC    $2.19    $2.30    5.02%    3,880
SBLK    $12.13    $12.71    4.78%    10,123
OCLR    $8.95    $9.29    3.80%    147,875
FRSH    $5.79    $6    3.63%    6,100
KTOS    $7.88    $8.16    3.55%    6,901
INCY    $135.5    $139.75    3.14%    6,734
TVIX    $35.4    $36.45    2.97%    234,847
OSUR    $12.3    $12.65    2.85%    4,500
[download]

Here's my code - I tried to enter comments to explain what I was thinking. Am hoping there is a cleaner way to use HTML::TableExtract to get real close to the final tab delimtied file.

Assume pulling apart the characters between the open price and change percent, is pretty tricky but can't the rest of the fields get dropped directly to a tab delimited file without the extraneous junk?

 
use strict;
use warnings;
use HTML::TableExtract;

#Get HTML file and set up headers for HTML::TableExtract
my $doc = 'nasdaq-stocks.txt';
my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume
+'];
#table 4 is advances.  Need to do again for 5 decliners
my $table_extract = HTML::TableExtract->new(count => 4, headers => $he
+aders);  

#parse the nasdaq-stocks.txt file and print to outup-temp.txt file
#?? found this code.  
#Is the code below taking HTML loaded in string $table and 
#breaking into rows to print to a file???

$table_extract->parse_file($doc);
my ($table) = $table_extract->tables;
open (UPFILE, '>outup-temp.txt');
    for my $row ($table->rows) {
        print UPFILE @$row, "\n";
}
close(UPFILE);

#tried to add the Substitutes below to the loop above 
#but failed miserably
#.. am taking outup-temp.txt 
#and load the array @lines for removing junk in the loop below

my $filename = 'outup-temp.txt' ;
open my $fh , '<' , $filename or die "Cannot read '$filename': $!\n" ;
my @lines = <$fh> ;
close $fh ;

# process the array @lines and remove some of the junk
for ( @lines ) {
  s/^\s+// ; # No need for global substitution
  s/[\x0A\x0D]{3,}/\t/g; # 3 CR LF become a tab

#double tab-change to one tab - never got this to work??
# s/[\x09]{2,}/\t/g; 

  s/\$//g; # Substitute all dollar signs with nothing
  s/\x20/\t/g; # space becomes a tab
# Change chars between open and change pct to tab
  s/\xC2\xA0\xE2\x96\xB2\xC2\xA0/\t/; 
}

#write cleaned lines to outup-temp.txt
open $fh , '>' , $filename or die "Cannot write '$filename': $!\n" ;
print $fh @lines ;
close $fh ;

# now that we have some tab delimiters, use split to break out the 
# fields and calculate the closing price, then write to file
my $stock;
my $filler1;
my $openpr;
my $change;
my $pct;
my $vol;
my $filler2;
my $closepr;

open (FILE, 'outup-temp.txt');
open STDOUT, '>', "outup.txt";
while (<FILE>) {
     chomp;
    ($stock,$filler1,$openpr,$change,$pct,$vol,$filler2) = split("\t")
+;
        #calculate closing price from prior day for advancers
    $closepr = $openpr-$change; 
        #add back $ signs - print tab delimited fields to file
    print "$stock\t\$$closepr\t\$$openpr\t$pct\t$vol\n"; 
}
close(FILE);
close (STDOUT);
[download]

Thanks in advance. Have googled many things from this website that helped to get my kludg-ie code working

Comment on HTML::TableExtract - ugly - is there better way? Select or Download Code

Replies are listed 'Best First'.
Re: HTML::TableExtract - ugly - is there better way? by NetWallah (Canon) on Apr 09, 2017 at 07:27 UTC
Try this: use strict; use warnings; use HTML::TableExtract; #Get HTML file and set up headers for HTML::TableExtract my $doc = 'nasdaq-stocks.txt'; my $html = do{ local $/=undef; open my $f,"<", $doc or die $!;<$f>}; my $headers = ['Symbol', 'Last Sale', 'Change Net / %', 'Share Volume +']; #table 4 is advances. Need to do again for 5 decliners my $table_extract = HTML::TableExtract->new(count => 4, headers => $he +aders); $table_extract->parse($html); print join (" \t",@$headers),"\n"; for my $r ($table_extract->rows()){ my @cols = map {/([\w\.]+)\W+([\w\.\%])/} @$r; print join ("\t",@cols), "\n"; } [download] It would take a little work to put the "$" back in front of the "Last Sale" amount, but this should get you started. ...it is unhealthy to remain near things that are in the process of blowing up. man page for WARP, by Larry Wall*	[reply] [d/l]
Re^2: HTML::TableExtract - ugly - is there better way? by rtwolfe (Initiate) on Apr 10, 2017 at 03:31 UTC
Thanks NetWallah. Had not heard about Map before. Still a little confusing. Not sure what @$r 'is'. Assume is parsed version of @cols. Need to decode your regex bit by bit. But, thanks so much for fast response.	[reply]
Re^3: HTML::TableExtract - ugly - is there better way? by NetWallah (Canon) on Apr 10, 2017 at 03:39 UTC
This site loves to explain code .. so keep those questions coming. @$r is the same as @{ $r } . $r is an array-reference ... adding the @{ } around it converts that into an array that can be iterated. 'map' will "transform" each element of the array, returning a modified array. The "transformation" is in the form of a regular-expression - in this case, it extracts "word" type characters (\w), decimals (.) and so on. See perlre. The result of this map is stored in the array @cols, which is later printed. Hope this helps. If still unclear .. experiment, and come back with more specific questions. ...it is unhealthy to remain near things that are in the process of blowing up. man page for WARP, by Larry Wall	[reply]
Re: HTML::TableExtract - ugly - is there better way? by poj (Abbot) on Apr 09, 2017 at 14:52 UTC
If you know the characters you want, just eliminate everything else. #!perl use strict; use warnings; use HTML::TableExtract; use LWP::UserAgent (); my $url = 'http://www.nasdaq.com/extended-trading/premarket-mostactive +.aspx'; my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume +']; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ( !$response->is_success) { die $response->status_line; } my $htm = $response->decoded_content; # table4 my $table_extract = HTML::TableExtract->new( count => 4, headers => $headers); my $tbl = $table_extract->parse($htm); my $data = cleanup($tbl); report('Advances',$data); # table5 $table_extract = HTML::TableExtract->new( count => 5, headers => $headers); $tbl = $table_extract->parse($htm); $data = cleanup($tbl); report('Decliners',$data); sub cleanup { my $table = shift; my @data = (); for my $row ($table->rows) { my @clean = map{ s/[^A-Z0-9%,+-\.]/ /g; # allowable s/^ +\| +$//g; # trim spaces $_ } @$row; push @data,\@clean; } return \@data; } sub report { my ($title,$data) = @_; print "$title\n"; for (@$data){ my ($stock,$openpr,$tmp,$vol) = @$_; my ($change,$pct) = split / +/,$tmp; my $closepr = $openpr - $change; print join "\t",($stock,'$'.$closepr,'$'.$openpr,$pct,$vol); print "\n"; } } [download] poj	[reply] [d/l]
Re^2: HTML::TableExtract - ugly - is there better way? by rtwolfe (Initiate) on Apr 10, 2017 at 03:25 UTC
POJ - WOW! Impressed by how quick one gets a response. Thank you for adding code to pull the webpage in the script. Also, need to learn to be 'smarter' about parsing. Never occurred to me to just take out the special characters and replace with space. Using the 'pipe' to drop leading and trailing spaces is cool too. Haven't run into Map before but read a little about it today. Need to understand what @$row is/does. Using subroutines was helpful too. This beginner is very appreciative.	[reply]