comment on

Am a beginner so be 'gentle'. Have built a program to pull data from a NASDAQ page for certain stocks. HTML::TableExtract looked like a slick way to go. Nasdaq will tweak this page occasionally and thought by manipulating headers and the count, I could easily keep the script working. And yes, this current script does accomplish my goal. But thought this was a 'teachable moment' to learn more about Perl.

I put this line in a windows batch file to dump the page data to a file for processing.

perl -e "use LWP::Simple; getprint('http://www.nasdaq.com/extended-tra
+ding/premarket-mostactive.aspx')" >> nasdaq-stocks.txt
[download]

My biggest issue is that when I pull the HTML file data through HTML::TableExtract, there is a lot of clean up work to get the tab delimited format wanted. But to get there, I fell back to writing to files and parsing / substituting rows/lines until I got in the final format.

Here's what the final (good) output looks like.

AMRI    $13.53    $14.75    9.02%    6,984
AUPH    $6.59    $7.07    7.28%    632,035
ATEC    $2.19    $2.30    5.02%    3,880
SBLK    $12.13    $12.71    4.78%    10,123
OCLR    $8.95    $9.29    3.80%    147,875
FRSH    $5.79    $6    3.63%    6,100
KTOS    $7.88    $8.16    3.55%    6,901
INCY    $135.5    $139.75    3.14%    6,734
TVIX    $35.4    $36.45    2.97%    234,847
OSUR    $12.3    $12.65    2.85%    4,500
[download]

Here's my code - I tried to enter comments to explain what I was thinking. Am hoping there is a cleaner way to use HTML::TableExtract to get real close to the final tab delimtied file.

Assume pulling apart the characters between the open price and change percent, is pretty tricky but can't the rest of the fields get dropped directly to a tab delimited file without the extraneous junk?

 
use strict;
use warnings;
use HTML::TableExtract;

#Get HTML file and set up headers for HTML::TableExtract
my $doc = 'nasdaq-stocks.txt';
my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume
+'];
#table 4 is advances.  Need to do again for 5 decliners
my $table_extract = HTML::TableExtract->new(count => 4, headers => $he
+aders);  

#parse the nasdaq-stocks.txt file and print to outup-temp.txt file
#?? found this code.  
#Is the code below taking HTML loaded in string $table and 
#breaking into rows to print to a file???

$table_extract->parse_file($doc);
my ($table) = $table_extract->tables;
open (UPFILE, '>outup-temp.txt');
    for my $row ($table->rows) {
        print UPFILE @$row, "\n";
}
close(UPFILE);

#tried to add the Substitutes below to the loop above 
#but failed miserably
#.. am taking outup-temp.txt 
#and load the array @lines for removing junk in the loop below

my $filename = 'outup-temp.txt' ;
open my $fh , '<' , $filename or die "Cannot read '$filename': $!\n" ;
my @lines = <$fh> ;
close $fh ;

# process the array @lines and remove some of the junk
for ( @lines ) {
  s/^\s+// ; # No need for global substitution
  s/[\x0A\x0D]{3,}/\t/g; # 3 CR LF become a tab

#double tab-change to one tab - never got this to work??
# s/[\x09]{2,}/\t/g; 

  s/\$//g; # Substitute all dollar signs with nothing
  s/\x20/\t/g; # space becomes a tab
# Change chars between open and change pct to tab
  s/\xC2\xA0\xE2\x96\xB2\xC2\xA0/\t/; 
}

#write cleaned lines to outup-temp.txt
open $fh , '>' , $filename or die "Cannot write '$filename': $!\n" ;
print $fh @lines ;
close $fh ;

# now that we have some tab delimiters, use split to break out the 
# fields and calculate the closing price, then write to file
my $stock;
my $filler1;
my $openpr;
my $change;
my $pct;
my $vol;
my $filler2;
my $closepr;

open (FILE, 'outup-temp.txt');
open STDOUT, '>', "outup.txt";
while (<FILE>) {
     chomp;
    ($stock,$filler1,$openpr,$change,$pct,$vol,$filler2) = split("\t")
+;
        #calculate closing price from prior day for advancers
    $closepr = $openpr-$change; 
        #add back $ signs - print tab delimited fields to file
    print "$stock\t\$$closepr\t\$$openpr\t$pct\t$vol\n"; 
}
close(FILE);
close (STDOUT);
[download]

Thanks in advance. Have googled many things from this website that helped to get my kludg-ie code working

In reply to HTML::TableExtract - ugly - is there better way? by rtwolfe

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.