rtwolfe has asked for the wisdom of the Perl Monks concerning the following question:
Am a beginner so be 'gentle'. Have built a program to pull data from a NASDAQ page for certain stocks. HTML::TableExtract looked like a slick way to go. Nasdaq will tweak this page occasionally and thought by manipulating headers and the count, I could easily keep the script working. And yes, this current script does accomplish my goal. But thought this was a 'teachable moment' to learn more about Perl.
I put this line in a windows batch file to dump the page data to a file for processing.
My biggest issue is that when I pull the HTML file data through HTML::TableExtract, there is a lot of clean up work to get the tab delimited format wanted. But to get there, I fell back to writing to files and parsing / substituting rows/lines until I got in the final format.perl -e "use LWP::Simple; getprint('http://www.nasdaq.com/extended-tra +ding/premarket-mostactive.aspx')" >> nasdaq-stocks.txt
Here's what the final (good) output looks like.
AMRI $13.53 $14.75 9.02% 6,984 AUPH $6.59 $7.07 7.28% 632,035 ATEC $2.19 $2.30 5.02% 3,880 SBLK $12.13 $12.71 4.78% 10,123 OCLR $8.95 $9.29 3.80% 147,875 FRSH $5.79 $6 3.63% 6,100 KTOS $7.88 $8.16 3.55% 6,901 INCY $135.5 $139.75 3.14% 6,734 TVIX $35.4 $36.45 2.97% 234,847 OSUR $12.3 $12.65 2.85% 4,500
Here's my code - I tried to enter comments to explain what I was thinking. Am hoping there is a cleaner way to use HTML::TableExtract to get real close to the final tab delimtied file.
Assume pulling apart the characters between the open price and change percent, is pretty tricky but can't the rest of the fields get dropped directly to a tab delimited file without the extraneous junk?
use strict; use warnings; use HTML::TableExtract; #Get HTML file and set up headers for HTML::TableExtract my $doc = 'nasdaq-stocks.txt'; my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume +']; #table 4 is advances. Need to do again for 5 decliners my $table_extract = HTML::TableExtract->new(count => 4, headers => $he +aders); #parse the nasdaq-stocks.txt file and print to outup-temp.txt file #?? found this code. #Is the code below taking HTML loaded in string $table and #breaking into rows to print to a file??? $table_extract->parse_file($doc); my ($table) = $table_extract->tables; open (UPFILE, '>outup-temp.txt'); for my $row ($table->rows) { print UPFILE @$row, "\n"; } close(UPFILE); #tried to add the Substitutes below to the loop above #but failed miserably #.. am taking outup-temp.txt #and load the array @lines for removing junk in the loop below my $filename = 'outup-temp.txt' ; open my $fh , '<' , $filename or die "Cannot read '$filename': $!\n" ; my @lines = <$fh> ; close $fh ; # process the array @lines and remove some of the junk for ( @lines ) { s/^\s+// ; # No need for global substitution s/[\x0A\x0D]{3,}/\t/g; # 3 CR LF become a tab #double tab-change to one tab - never got this to work?? # s/[\x09]{2,}/\t/g; s/\$//g; # Substitute all dollar signs with nothing s/\x20/\t/g; # space becomes a tab # Change chars between open and change pct to tab s/\xC2\xA0\xE2\x96\xB2\xC2\xA0/\t/; } #write cleaned lines to outup-temp.txt open $fh , '>' , $filename or die "Cannot write '$filename': $!\n" ; print $fh @lines ; close $fh ; # now that we have some tab delimiters, use split to break out the # fields and calculate the closing price, then write to file my $stock; my $filler1; my $openpr; my $change; my $pct; my $vol; my $filler2; my $closepr; open (FILE, 'outup-temp.txt'); open STDOUT, '>', "outup.txt"; while (<FILE>) { chomp; ($stock,$filler1,$openpr,$change,$pct,$vol,$filler2) = split("\t") +; #calculate closing price from prior day for advancers $closepr = $openpr-$change; #add back $ signs - print tab delimited fields to file print "$stock\t\$$closepr\t\$$openpr\t$pct\t$vol\n"; } close(FILE); close (STDOUT);
Thanks in advance. Have googled many things from this website that helped to get my kludg-ie code working
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML::TableExtract - ugly - is there better way?
by NetWallah (Canon) on Apr 09, 2017 at 07:27 UTC | |
by rtwolfe (Initiate) on Apr 10, 2017 at 03:31 UTC | |
by NetWallah (Canon) on Apr 10, 2017 at 03:39 UTC | |
|
Re: HTML::TableExtract - ugly - is there better way?
by poj (Abbot) on Apr 09, 2017 at 14:52 UTC | |
by rtwolfe (Initiate) on Apr 10, 2017 at 03:25 UTC |