Extract table info and create txt file

TomBombadil has asked for the wisdom of the Perl Monks concerning the following question:

I started with Perl a few days ago and now reached a point I don't see how to get further. Thus, I'd like to have a text file created for each id (max. currently 30000) on www.securityfocus.com/bid. The information I need is in a table, this is why I use depth and count. So far I have coded the following:

#!C:\perl\bin\perl.exe -w

# Purpose: Script for extracting data from tables and write it to a te
+xt file
# Version: 0.3

print "Content-type: text/html\n\n";

use CGI::Carp qw(fatalsToBrowser);
use strict;
use HTML::TableExtract;

my $table;  # table of interest
my $html_file = "http://www.securityfocus.com/bid"; # url of web site
my $te;   # table extract
my $ts;   # table search
my $row;  # row of table of interest
my @securityfocus;  # array

for(1..30000) {
  my $table = $html_file."/".$_;
  $te = HTML::TableExtract->new( depth => 1, count => 0 );
  $te->parse_file($table);
}

foreach $ts ($te->tables) {
   print "Table found at ", join(',', $ts->coords), ":\n";
   foreach $row ($ts->rows) {
       print "   ", join(',', @$row), "\n";
    }
}

@securityfocus=("Bugtraq ID: \n","Class: \n","CVE: \n","Remote: \n","L
+ocal: \n",
"Published: \n","Updated: \n","Credit: \n","Vulnerable: \n","Not Vulne
+rable: \n");
open(OUTPUTFILE,">bid.txt") or die "Can't open bid.txt $!";
print OUTPUTFILE @securityfocus;
close(OUTPUTFILE) or die "Can't close bid.txt $!";

open(OUTPUTFILE,"bid.txt") or die "Can't open bid.txt $!";
while (<OUTPUTFILE>)
{
chomp;
print " $_ \n";
}
close(OUTPUTFILE) or die "Can't close bid.txt $!";
[download]

I appreciate your help - Tom

Comment on Extract table info and create txt file Download Code

Replies are listed 'Best First'.
Re: Extract table info and create txt file by Util (Priest) on Jun 07, 2007 at 15:54 UTC
You are attempting to web-scrape 30,000 pages from a single commercial site. If you have not obtained permission to do so, this could be considered abusive behavior, especially because 25% of your requests would be to non-existent pages, (there are only about 22,500 pages, with gaps in the numbering), and you seem to have no plan for caching the pages (30,000 page requests each time you test your program). Before you pursue this further, please see the Download Page for the National Vulnerability Database. `NVD/CVE XML Data Files: (All up-to-date as of today!) 3.8MB nvdcve-2007.xml 10.9MB nvdcve-2006.xml 6.8MB nvdcve-2005.xml 4.3MB nvdcve-2004.xml 1.9MB nvdcve-2003.xml 7.7MB nvdcve-2002.xml vulnerabilities prior to and including 2002 0.2MB nvdcve-recent.xml all recently published vulnerabilities 0.2MB nvdcve-modified.xml all recently published and recently updated vulnerabilities` [download] If these files contain the data you need, then this is a much better way to proceed. Whether you use the HTML pages or the recommended XML files, you should download them as a separate step from your Perl code. You can do the downloading via a second Perl program using LWP, or via a specialized download tool like `wget` or (my favorite in Linux and Win32) cURL. Once you have your source data downloaded, only then should you tackle the parsing. Let us know if you need help with that parsing.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Extract table info and create txt file
by Util (Priest) on Jun 07, 2007 at 15:54 UTC

You are attempting to web-scrape 30,000 pages from a single commercial site. If you have not obtained permission to do so, this could be considered abusive behavior, especially because 25% of your requests would be to non-existent pages, (there are only about 22,500 pages, with gaps in the numbering), and you seem to have no plan for caching the pages (30,000 page requests each time you test your program).

Before you pursue this further, please see the Download Page for the National Vulnerability Database.

NVD/CVE XML Data Files:
(All up-to-date as of today!)
 3.8MB nvdcve-2007.xml
10.9MB nvdcve-2006.xml
 6.8MB nvdcve-2005.xml
 4.3MB nvdcve-2004.xml
 1.9MB nvdcve-2003.xml
 7.7MB nvdcve-2002.xml     vulnerabilities prior to and including 2002
 0.2MB nvdcve-recent.xml   all recently published vulnerabilities
 0.2MB nvdcve-modified.xml all recently published
                           and recently updated vulnerabilities
[download]

Whether you use the HTML pages or the recommended XML files, you should download them as a separate step from your Perl code. You can do the downloading via a second Perl program using LWP, or via a specialized download tool like `wget` or (my favorite in Linux and Win32) cURL. Once you have your source data downloaded, only then should you tackle the parsing. Let us know if you need help with that parsing.

[reply]
[d/l]