Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm a newbie to Perl lanaguage and have only basic programming knowledge.

I want to know if this possible solution to the following problem might work:

First, I have a list of account numbers in a file called accts.txt that contains 10 different account numbers (1 per line)

Secondly, I have my data files in another directory, in which I want to retrieve information for the set of account numbers in accts.txt

What I am trying to do here is to create a smaller subset of my data for only the account numbers I want to pull data for from each of the files.

At first, I wrote a short script to grep each line for all files in the directory for an accountthat exists in accts.txt

But the challenge here is that the account numbers maybe listed in multiple places in a single line. I only want to search in the account number field. However, the account number field maybe the second field in one file, and the 5th field in another file. (The data is .txt files with | as the delimiter)

For example, I have a set of 10 files. For 7 out of 10 files, the account number is the first 9 characters in the line (also known as the first field). I am using substr function to check if the account number from accts.txt matches the first 9 digits in each line of each of these 7 files. But I am not sure how to handle this for the other 3 files, where acct number is located in digits 21-29, 11-19, and 12-20 subsequently.

I would use a foreach loop to read through the files, but can I use if conditions within foreach to exclude the other 3 files?

Any suggestions would be greatly appreciated! Thanks!
  • Comment on retrieving information from a set of files

Replies are listed 'Best First'.
Re: retrieving information from a set of files
by swampyankee (Parson) on Oct 20, 2006 at 22:22 UTC

    Perl is quite close to ideal for this sort of work.

    My first tendency would be to use split; this means I get avoid playing games with substr (If I had to do that, I don't see much point is using Perl 8-)).

    substr will work. The only issue, then, is how to determine which start/length pair to use for each file. This is easy (pseudo-code, not real code follows!):

    my %layout = (file1 => 0, file2 => 0, ... file8 => 20, file9 ->10, file10 => 11); #note that Perl starts counting at zero, so 1 had to be subtracted fro +m the starting positions. # now, for each file, you could yank the acct number out like this: $acct = substr($line, $layout{$file_name}, 9); # where $file_name corresponds to file1, file2, etc, in %layout as app +ropriat
    Now, if your file names are not fixed, but use a predictable naming convention, the problem is slightly different, but still easy. Magic cookies would be similarly simple. If the files have column headers, you could parse that to find which field has account numbers. This is slightly trickier.

    So, you code would have this sort of logic:

    • Read list of account numbers. It's small, so store it in an array (open(my $acct, "<",$account_to_check);@acct_num = <$acct>;)
    • Open one info file (there are ten...). Use the %layout hash to find where the account number starts.
    • Using substr and the %layout table, yank out the account number. If it's on your list (grep is perfect for this), perform your extract, and either print or store the information. If it's not, use next to skip to the next record.
    • Repeat as needed

    Again, my preference is generally to use split for simple delimited files; if you've got fields which include the pipe symbol (A|B|C|"this has a | pipe symbol but is one field"), you'll be better served by using a module, such as Text::CSV (I'm not endorsing that one; the only delimited files I've dealt with, I've controlled, so I knew, with certainty, that I wouldn't have fields where the delimiter should be ignored).

    emc

    At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

    —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.
      Will this work?
      $accts = accts_file.txt; $dirtoget = "/my/data/dir/with/files"; opendir(FILEDIR, $dirtoget) || die("Cannot open directory"); @thefiles = grep -T, <$sample_dirtoget/*>; closedir(FILEDIR); my %layout = (file1 => 0, file2 => 0, file3 => 0, file4 => 0, file5 => + 0, file6 => 0, file6 => 0, file8 => 20, file9 ->10, file10 => 11); foreach $file (@thefiles) { $sample_filename = substr($file, 57); open(FILE, $file) || die ("Cannot open $file:$!"); while ($line = <FILE>){ chomp $line; $acct = substr($line, $layout{$file}, 9); `grep $acct $accts_file.txt >> NEW_$sample_filename`; } close(FILE); }
        Have you tried it? Did it work? If not, how did it fail?

        There are some obvious problems, most of which would be revealed to you if you add "use strict;" and "use warnings;" at the top of the script: no quotes around "accts_file.txt", "$accts" is used only once,  grep -T,<$sample_dirtoget/*> should have curlies around -T and no comma, and you didn't assign a value to $sample_dirtoget. You do opendir, but then you use a glob instead of readdir (so opendir was unnecessary). There's probably more stuff like that, but you get the idea...

        Doing a bunch of back-ticked grep commands inside your while loop isn't such a good solution, especially since you are not doing any error checking on those commands. In fact, I think you've lost your train of thought there. This probably is not really doing what you set out to do.

        And I'm not quite sure I understand what you're trying to do with the %layout hash. Are those the actual file names you are using as hash keys? What if there's a file whose name doesn't match one of those keys? (Why get file names from a glob or readdir if you have the names in a hash?)

        If the account number field is always nine digits (bounded by non-digit characters), is it the case that other fields in any given file would never contain exactly nine digits (bounded by non-digit characters)? If so, you could try something like this:

        use strict; use warnings; # get the list of accounts numbers we're looking for: my %target_acc; my $target_file = "accts_file.txt"; open( TARGS, "<", $target_file ) or die "$target_file: $!"; while (<TARGS>) { chomp; $target_acc{$_} = undef; # keep target account #' as hash keys } close TARGS; # get the list of files we want to search over: my $datadir = "/my/data/dir/with/files"; opendir( DIR, $datadir ) or die "$datadir: $!"; my @files = grep {-T "$datadir/$_"} readdir DIR; closedir DIR; # open each file to be searched, output found lines to # a corresponding NEW file in the current directory for my $file ( @files ) { open( OUT, ">", "NEW_$file" ) or die "NEW_$file: $!"; if ( open( IN, "<", "$datadir/$file" )) { while (<IN>) { print if (( /^(\d{9})\|/ or /\|(\d{9})[|\r\n]/ ) and exists( $target_acc{$1} )); } close IN; } close OUT; }
        That hasn't been tested, but it does compile properly with strict and warnings enabled. Instead of using substring or split, I'm just trying a regex on each line of each file being searched, to match a 9-digit string either at the beginning of the line and followed by "|", or anywhere in the line, preceded by "|" and followed by "|" or newline. Then I see whether the 9-digit string that matched happens to be one of the numbers of interest, simply by checking the %target_acc hash.
Re: retrieving information from a set of files
by duckyd (Hermit) on Oct 20, 2006 at 21:23 UTC
    How do you know where the account number is in a given file? If you have a header row, then the solution is easy, so I assume you don't. If that's the case, you will have to determine some way to tell where the account number is in a given file. For example, it sounds like your fields are fixed length. Are you guaranteed that the account number field will be the only field that is 9 characters long?

    If you have to manually inspect each file to find the account number field, then I am not sure how you would expect to solve the problem progamatically.

      The account number is a fixed 9 digits that is always the first 9 digits in 7 of the files. For the remaining files also it is always in a fixed location on each line of the file.
        I forgot to note 1 additional comment: the number of fields may vary according to the file. Although my sample data illustrates only 3 columns, this is not true for the files I have.
      I've created some sample data to illustrate what i'm speaking of. I hope this helps?
      1 of 3 exception files look like this:
      234abc|00011|123456789
      224abc|01011|122456789
      244abc|10011|123356789
      254abc|11011|123446789

      Other 7 files are in this format:
      123456789|234abc|00011
      123456789|234abc|00011
      123456789|234abc|00011
      122456789|224abc|01011
      123356789|244abc|10011
      123356789|244abc|10011
      123446789|254abc|11011

      accts.txt
      123456789
      122456789
      123356789
      123446789