Limo has asked for the wisdom of the Perl Monks concerning the following question:

I know that I have sought your advice on this before, but I'm still kinda stuck with this problem. I have a program which takes a structured format file and reports back data contained in specified fields. I am writing a program which will "join" the output of 2such files. I have gathered from all of you that my best bet would be to use a relational database. So, armed with DBI, DBD::CSV, etc and my new "Programming the Perl DBI" book, I am off to tackle this once again. First, let me demonstrate what I am trying to accomplish. eg: exfields.pl file1.gz name,ipAdd,DestDev which produces the following output:
"atlanta" "192.168.1.1" "vienna" "miami" "192.168.1.2" "dallas" "nyc4" "192.168.1.3" "boston"
As you can see, "name,ipAdd,DestDev" are names of structured fields within the gzipped file. Let's say I ran this:
exfields.pl file2.gz name,ifSpeed
which produces the following output:
"atlanta" "622000000" "miami" "155000000" "nyc4" "155000000"
my new program would "merge" the output of both of those files producing:
"atlanta" "192.168.1.1" "vienna" "622000000" "miami" "192.168.1.2" "dallas" "155000000" "nyc4" "192.168.1.3" "boston" "155000000"
So, here's what I've got so far:
#!/usr/local/bin/perl -w ## merge.pl use strict; if ($ARGV[3] ne "") { my $file1 = $ARGV[0]; my $list1 = join (",", $ARGV[1]); my $file2 = $ARGV[2]; my $list2 = join (",", $ARGV[3]); my %hash_1 = process($file1, $list1); my %hash_2 = process($file2, $list2); }else { print "merge.pl - Usage: file field(s) file field(s)\n"; } sub process { my ($file, $list) = @_; my $script = "perl /home/limo/Perl/exfields.pl -e"; my %hash; my $arg; for $arg (split /,/ => $list) { open(FILE, "$script $arg $file |") or die "System error: $!\n"; while (<FILE>) { next if /^(#|none|unkno)/i; chomp; push @{ $hash{$arg} }, split; } close FILE; } return %hash; }
I suppose that I have three questions: 1. It was sugessted that I return a reference to a hash, thus conserving some memory. Would that just be, return /%hash? 2. In any case, what kind of print statement can I insert to test the contents of each hash, as returned? 3. I can't seem to find any information regarding reading data contained in a hash into DBI? Oh yeah, the reason that I am using a hash is that I would like the keys of the hash to become the column headers in the database table that I create. Any direction/help, as always, is much appreciated!

Replies are listed 'Best First'.
Re: (dchetlin) Data in Hash - DBI
by dchetlin (Friar) on Oct 01, 2000 at 04:39 UTC
    This is a great question, definite ++. Clearly stated problem, concise code, just all around beautiful. I wish people posted questions like this on clpm.

    First, the answers to your questions:

    • It was sugessted that I return a reference to a hash, thus conserving some memory. Would that just be, return /%hash?

    return \%hash. The backslash is the reference operator. And yes, you definitely should return a reference to a hash here.

    When you return a hash, you're not really returning a hash. You say return %hash and perl says "Ok, he wants to return a hash. Let's see, was this function called in list or scalar context? If it was list context, I'll just take all of the keys and values of the hash, push each of them onto the stack, and return. That way the function will return a list, and if the caller assigned something to the return of the function like %hash = function() then that hash will be populated by that list that was returned, just like %hash = ('a','b','c','d'). If the function was called in scalar context, I'll do something really bizarre and return a scalar that looks something like '3/8' which shows the number of used buckets and the total number of buckets in the hash. Because, obviously, people use that information all the time."

    As you can probably imagine, pushing every key and value of a hash onto the stack is quite inefficient. You've already built the hash once, you'd like to return that same hash, not make perl build you a whole new one. So you return a reference to the hash, and call the function like my $hash_ref1 = process($file1,$list1).

    • In any case, what kind of print statement can I insert to test the contents of each hash, as returned?

    use Data::Dumper. It's the easiest way to take a look at the contents of any data structure. Simply `use' it at the top of your program, and then do print Dumper $hash_ref1; and Dumper will give you a nicely formatted look at the hash.

    • I can't seem to find any information regarding reading data contained in a hash into DBI? Oh yeah, the reason that I am using a hash is that I would like the keys of the hash to become the column headers in the database table that I create.

    The general answer to "How do I store a hash in a database" is "Use Storable". Storable does binary serialization of Perl structures. You'll serialize your hash, and then add that string (the Storable representation of your hash) to the database.

In terms of general advice, I have a few things to point out:

  • Your method of testing for the correct number of arguments is not the best way. Instead of making sure that $ARGV[3] isn't equal to the empty string, why not just test to make sure @ARGV has exactly four arguments (or at least four arguments, depending on how flexible you want to be). I.e. if (@ARGV == 4) { ... } else { Print "Usage: ... " }
  • Your joins on the command-line arguments aren't doing anything useful. join operates on a list of things, putting them together in one string. You're only giving join the separator and one argument, so it's basically a no-op. It looks like that's OK, though, because you're expecting the arguments to be comma-separated already.
  • When you print a usage statement, it's often sensible to use $0 to reflect the way that the script was actually called. Also, in case you change the name of the script later, things won't get out of sync.
  • It's hard for me to tell what your eventual goal here is, so it's a little hard for me to advise on what to do next, but it seems to me that a couple of things should probably be changed. Firstly, you'll probably want to choose a more robust way of getting the functionality of exfields.pl than calling it using the magic open. Either set it up so it can be required, or copy the useful functions from it, or turn it into a module, etc. Secondly, I'm not convinced that the way you're structuring your hashes is going to make for easy or even possible merging. I'd restructure somewhere, but without having a better idea of what exfields does and what you want the stuff to look like in the DB, I can't suggest more than that.

-dlc

      Thank you for your kind words as well as your advice. I wouldn't even be able to write BAD code, if it were not for you folks! Rather than post 800 lines of code that make up exfields.pl, I will post a synopsis below. Possibly, that will give you and others insight into my problem.
      "Extracts records (selectively if requested) from a NPA [my group's typical file format]or MAGMA [db schema; irrelevant here] table dumps, producing either the entire record or (optionally) specified fields in the given order. Parameters can be in any order. Output is to STDOUT, and can be piped into this program (for further selections), or into updated versions of exfields.pl or showtable.pl. The -s option is used to select records according to whether a regexp match exists within a -f specified field or within the entire record. The -e option is used, optionally, to specify the fieldlist and output field order. One major use of this program is to create a reduced table, whose output contains only selected rows and columns. Another is to transform table format or order."
      Now, here's an example of the gunzipped file headers:
      #DFD ' ' #H SrcRtr ifIndex ifPRule ifDescr ifName ifType ifSpeed ifPhysAddress ifWscModPort ifWscPortName ifAlias ifWscPortVlan ifVlanState ifVlanName TPNativeVlan TPDynStatus TPEncapOpType ifCP ifStackLL ifStackUL cdpFlg cdpCacheCnt cdpCacheInfo ipN2MediaCnt ipN2MediaInfo atmMaxVpcs atmCfgVpcs atmMaxVccs atmCfgVccs atmInfo frActCnt frInACnt frInfo IpAddress IpMask IpSubNet ospfFlags ospfAreaInfo NumIpAddresses NumIpAddrUp CfgIpAddrUp CfgIpAddrStats NumLoopbacksUp CfgRtrPollUp CfgRtrLPollUp CfgStats CfgPoll CfgErrFlg CfgWarnFlg CfgIfMapType CfgIfSysID CfgPoP MonName CfgIfDescr CfgIfSpeed CfgIfIndex SPopRegion SPopType SPopUse SPopAllow SRtrRole DstExtent DPop DPopRegion DPopType DPopUse DPopAllow DPopCloud DstDev DstDevRole MediaDstInfo CfgIfType CfgIfServRole CfgIfPurpose CfgIfVpiA CfgIfDlciA CfgIfVpiZ CfgIfDlciZ CfgIfRemL2Dev CfgIfRemL2Port CfgIfVLAN FlowType RcIfFlags RcASN RcOspfs BgpLocalAS BgpExtnlAS RcBgpCfd RcBgpCfdPeers BgpRemASInfo OspfIfInfo RcPrtChan RcPrtChanList RcIfSpeed RcVpiVci RcAtmAalEP RcAtmPkbps RcAtmAkbps RcAtmBurst RcAtmPvcMapInfo RcFrDlci RcEncap IpOspfCost IpOspfPriority CldGroup CldName CldStatus CldRatio CldThresh CfgCid CirActive CirCIR CirSpeed CirOrdered CirOnLine CirOffLine ModBy ModOn RcText #F %16s %3d %5s %16s %s %3d %11.0f %17s %5s %11s %s %s %s %s %s %s %s %s %s %s %s %2d %s %2d %s %2s %2s %2s %2s %s %2s %2s %s %15s %15s %15s %3s %s %2d %1d %1d %1d %1d %1d %1d %1s %1s %5s %5s %3s %6s %-10s %-36s %16s %11.0f %3s %4s %2s %2s %7s %2s %-3s %-10s %4s %2s %2s %7s %20s %20s %2s %s %4s %4s %4s %4s %4s %4s %4s %15s %5s %5s %4s %4s %5s %5s %5s %5s %2s %s %s %s %s %s %11s %7s %12s %6s %6s %6s %s %5s %5s %5s %5s %20s %20s %10s %4s %4s %20s %5s %11s %11s %8s %8s %8s %10s %18s %s
      The output of exfields looks EXACTLY as shown in my original post, depending on which fields I want to look at, of course. Basically, what I envision for the database is for it to contain my selected headers from file1 and file2 as column names; each column containing the corresponding data. Column 1 in the db would contain the primary key, which in my case, is a router name. Essentially, that is what the 2 files will have in common; different sets of data related to each router. What I need to do is grab a subset of router data from file1, a subset of router data from file2, and merge both subsets into file3, which will be <STDOUT>. Actually, here's a sample output of exfields:
      ./exfields.pl -e MonName,ifSpeed,DstDev pptop.20000921.gz nyc1-br2/nyc4-br2:1.t3 45045000 nyc4-br2 nyc1-br2/nyc4-br2:2.t3 45045000 nyc4-br2 nyc1-br2/nyc4-br2:3.t3 45045000 nyc4-br2 nyc1-cr1/nyc4-br1.t3 44210000 nyc4-br1 nyc1-br1/nyc2-cr1.t3 45045000 nyc2-cr1 nyc4-br2/cleveland1-br1.2.t3 45045000 cleveland1-br1 nyc4-br2/cleveland1-br2.t3 44210000 cleveland1-br2 nyc4-br2/nyc1-br2:1.t3 45045000 nyc1-br2 nyc4-br2/nyc1-br2:2.t3 45045000 nyc1-br2 nyc4-br2/nyc1-br2:3.t3 45045000 nyc1-br2 nyc4-nbr2/nycmny1-br1.oc12 622000000 nycmny1-br1
      Again, it is the field "MonName" that both file1 and file2 will have in common.I hope that some of this makes my problem a bit clearer to those willing to help out.
(ar0n) Re: Data in Hash - DBI
by ar0n (Priest) on Oct 01, 2000 at 03:41 UTC
    1
    return \%hash (Note the slash)

    2
    I'm not sure what you mean with "testing the contents", but Data::Dumper does a nice job of printing out the contents for you.

    3
    I admit to not having entirely read perldoc DBI (ar0n ducks, thusly avoiding any DBI manuals thrown at him...) so I'm not aware of any such function. But this should work:
    my $place_holders = "?," x keys %hash; chop $place_holders; my $sth = $dbi->prepare("INSERT INTO my_lovely_table (".join(",", sort + keys %hash).") VALUES ($place_holders)"); $sth->execute( @hash{ sort keys %hash } );

    Note though, that the keys in your hashes in your code refer to anonymous arrays (push @{ $hash{$arg} }, split;), so the above won't work with those hashes.

    [ar0n]

      Thanks for your response. Sorry. What I mean by "testing contents" is, being a new programmer, I often slide in print statements in my code as I build it, to make sure that variables/data structures are getting populated the way I want them to. So, I'm trying to do the same, before I tackle passing the output of the program to DBI (if poss.as you noted). Also, I've got to read some more on DBD::CSV. I'm not sure that it supports "join()". UPDATE: Before the --'s start to really flow, what I meant was "join tables", which is a database function. I don't believe that DBD::CSV has that functionality.
Re: Data in Hash - DBI
by mattr (Curate) on Oct 01, 2000 at 09:51 UTC
    Whoa. Hope your original table didn't look like this..

    Magma language

      Different MAGMA. Our's is a datbase schema. And not at all related to my problem.