Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re^3: Making an array from a downloaded web page

by moklevat (Priest)
on Jan 16, 2007 at 16:34 UTC ( #594932=note: print w/replies, xml ) Need Help??

in reply to Re^2: Making an array from a downloaded web page
in thread Making an array from a downloaded web page

I am not an expert on SEC filings or the Edgar database, but after a few minutes of poking around the README file and docs directory it looks like everything would still be simpler with ftp unless you are a screen-scraping wizard.

As I have just learned, each company that files with the SEC has a CIK number. In your example, the CIK number for IBM is 51143. All of IBM's filings live in that directory From the explanation in the README it seems that prior to Edgar 7.0 (starting in the year 2000) all of each company's filings were stored in one directory, but because of problems with overwriting documents when ammendments were submitted, everything is now stored in sub directories based on the accession number of the documents. This might be why it appears that information is per-day. Fortunately, it looks like the SEC provides an index of filings for each quarter by company name or type of filing so you don't have to mess around with slogging through every sub-directory to find the information you need. The only benefit I see from the http interface in your example is that your search focused on filings related to change of ownership. However, I gather that these are related to a known subset of forms (4, K-8) and this information is available in the index, so you could subset the information yourself.

So, if it were me I would probably move forward in two stages using Net::ftp to 1) grab indices and subsetting the records I need based on company and filing-type to create a list of the files I want to get, and then 2) and grab those files with Net::ftp again.

  • Comment on Re^3: Making an array from a downloaded web page

Replies are listed 'Best First'.
Re^4: Making an array from a downloaded web page
by malomar66 (Acolyte) on Jan 18, 2007 at 05:27 UTC
    This will probably get me dinged in the reputation department but I have to ask.

    From what I can tell of the "get" function in Net::FTP it sends the results directly out to a file. It seems I can append the various files I will need to each other provided that I explicitly name the file and provide some kind of offset (0?). How might I be able to adapt this so that I can keep the information in a string and parse it, prior to making a master file of all the links?

      It's a reasonable question, but you still haven't posted any code, so you may indeed get dinged. In the monestary, code begets code. If your original post had included some kind of code, you probably would have had a lot more input from monks and might have a working solution by now.

      On to your question.

      The documentation for the get() method in Net::FTP mentions that for get(REMOTE_FILE[,LOCAL_FILE,WHERE]), LOCAL_FILE may be a filename or a filehandle. If you open() a filehandle for writing you can write as many index files as you want and they will be concatenated in the order they were written. WHERE is optional in the method, but you could use it to skip the first unnecessary header bytes of the index file. You can also open() an "in memory" filehandle that is held as a scalar. This is probably what you want. Here is a quick script that grabs the index files for all 4 quarters in 2 years and writes the concatenated indices to a file. I have also included a commented out option to use a scalar as a filehandle. This is what you will probably ultimately want to use.

      #!/usr/bin/perl use strict; use warnings; use Net::FTP; my $host = ""; my $username = 'anonymous'; my $password = ''; my $indexdir = '/edgar/full-index'; my @years = qw/2005 2006/; my @quarters = qw/QTR1 QTR2 QTR3 QTR4/; my $indexbyfirm = 'company.idx'; my $indexoutfile = "./complete_index"; ##This opens an "in memory" filehandle as a scalar #open my $indexsave, '>', \ my $pseudo_file # or die "Couldn't open memory handle: $!"; open my $indexsave, '>', $indexoutfile or die "Couldn't open filehandle: $!"; my $ftp= Net::FTP->new("$host", Timeout => 30, Debug => 1) or die "Couldn't connect: $@\n"; $ftp->login($username, $password) or die "Couldn't authenticate.\n"; for my $year (@years) { for my $quarter (@quarters) { $ftp->cwd("$indexdir/$year/$quarter") or die "Couldn't change directories : $!\n"; $ftp->get($indexbyfirm, $indexsave) or die "Couldn't fetch $indexbyfirm : $!\n"; } } ## You can work with the "in memory" file like any scalar # print "$pseudo_file"; $ftp -> quit();

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://594932]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2023-12-01 10:40 GMT
Find Nodes?
    Voting Booth?

    No recent polls found