TomBombadil has asked for the wisdom of the Perl Monks concerning the following question:

I started with Perl a few days ago and now reached a point I don't see how to get further. Thus, I'd like to have a text file created for each id (max. currently 30000) on www.securityfocus.com/bid. The information I need is in a table, this is why I use depth and count. So far I have coded the following:
#!C:\perl\bin\perl.exe -w # Purpose: Script for extracting data from tables and write it to a te +xt file # Version: 0.3 print "Content-type: text/html\n\n"; use CGI::Carp qw(fatalsToBrowser); use strict; use HTML::TableExtract; my $table; # table of interest my $html_file = "http://www.securityfocus.com/bid"; # url of web site my $te; # table extract my $ts; # table search my $row; # row of table of interest my @securityfocus; # array for(1..30000) { my $table = $html_file."/".$_; $te = HTML::TableExtract->new( depth => 1, count => 0 ); $te->parse_file($table); } foreach $ts ($te->tables) { print "Table found at ", join(',', $ts->coords), ":\n"; foreach $row ($ts->rows) { print " ", join(',', @$row), "\n"; } } @securityfocus=("Bugtraq ID: \n","Class: \n","CVE: \n","Remote: \n","L +ocal: \n", "Published: \n","Updated: \n","Credit: \n","Vulnerable: \n","Not Vulne +rable: \n"); open(OUTPUTFILE,">bid.txt") or die "Can't open bid.txt $!"; print OUTPUTFILE @securityfocus; close(OUTPUTFILE) or die "Can't close bid.txt $!"; open(OUTPUTFILE,"bid.txt") or die "Can't open bid.txt $!"; while (<OUTPUTFILE>) { chomp; print " $_ \n"; } close(OUTPUTFILE) or die "Can't close bid.txt $!";
I appreciate your help - Tom

Replies are listed 'Best First'.
Re: Extract table info and create txt file
by Util (Priest) on Jun 07, 2007 at 15:54 UTC

    You are attempting to web-scrape 30,000 pages from a single commercial site. If you have not obtained permission to do so, this could be considered abusive behavior, especially because 25% of your requests would be to non-existent pages, (there are only about 22,500 pages, with gaps in the numbering), and you seem to have no plan for caching the pages (30,000 page requests each time you test your program).

    Before you pursue this further, please see the Download Page for the National Vulnerability Database.

    NVD/CVE XML Data Files: (All up-to-date as of today!) 3.8MB nvdcve-2007.xml 10.9MB nvdcve-2006.xml 6.8MB nvdcve-2005.xml 4.3MB nvdcve-2004.xml 1.9MB nvdcve-2003.xml 7.7MB nvdcve-2002.xml vulnerabilities prior to and including 2002 0.2MB nvdcve-recent.xml all recently published vulnerabilities 0.2MB nvdcve-modified.xml all recently published and recently updated vulnerabilities
    If these files contain the data you need, then this is a *much* better way to proceed.

    Whether you use the HTML pages or the recommended XML files, you should download them as a separate step from your Perl code. You can do the downloading via a second Perl program using LWP, or via a specialized download tool like `wget` or (my favorite in Linux and Win32) cURL. Once you have your source data downloaded, only then should you tackle the parsing. Let us know if you need help with that parsing.