Perl is quite close to ideal for this sort of work.
My first tendency would be to use split; this means I get avoid playing games with substr (If I had to do that, I don't see much point is using Perl 8-)).
substr will work. The only issue, then, is how to determine which start/length pair to use for each file. This is easy (pseudo-code, not real code follows!):
my %layout = (file1 => 0, file2 => 0, ... file8 => 20,
file9 ->10, file10 => 11);
#note that Perl starts counting at zero, so 1 had to be subtracted fro
+m the starting positions.
# now, for each file, you could yank the acct number out like this:
$acct = substr($line, $layout{$file_name}, 9);
# where $file_name corresponds to file1, file2, etc, in %layout as app
+ropriat
Now, if your file names are not fixed, but use a predictable naming convention, the problem is slightly different, but still easy. Magic cookies would be similarly simple. If the files have column headers, you could parse that to find which field has account numbers. This is slightly trickier.
So, you code would have this sort of logic:
- Read list of account numbers. It's small, so store it in an array (open(my $acct, "<",$account_to_check);@acct_num = <$acct>;)
-
Open one info file (there are ten...). Use the %layout hash to find where the account number starts.
-
Using substr and the %layout table, yank out the account number. If it's on your list (grep is perfect for this), perform your extract, and either print or store the information. If it's not, use next to skip to the next record.
-
Repeat as needed
Again, my preference is generally to use split for simple delimited files; if you've got fields which include the pipe symbol (A|B|C|"this has a | pipe symbol but is one field"), you'll be better served by using a module, such as Text::CSV (I'm not endorsing that one; the only delimited files I've dealt with, I've controlled, so I knew, with certainty, that I wouldn't have fields where the delimiter should be ignored).
emc
At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.
—Igor Sikorsky, reported in AOPA Pilot magazine February 2003.
| [reply] [d/l] |
$accts = accts_file.txt;
$dirtoget = "/my/data/dir/with/files";
opendir(FILEDIR, $dirtoget) || die("Cannot open directory");
@thefiles = grep -T, <$sample_dirtoget/*>;
closedir(FILEDIR);
my %layout = (file1 => 0, file2 => 0, file3 => 0, file4 => 0, file5 =>
+ 0, file6 => 0, file6 => 0, file8 => 20,
file9 ->10, file10 => 11);
foreach $file (@thefiles) {
$sample_filename = substr($file, 57);
open(FILE, $file) || die ("Cannot open $file:$!");
while ($line = <FILE>){
chomp $line;
$acct = substr($line, $layout{$file}, 9);
`grep $acct $accts_file.txt >> NEW_$sample_filename`;
}
close(FILE);
}
| [reply] [d/l] |
Have you tried it? Did it work? If not, how did it fail?
There are some obvious problems, most of which would be revealed to you if you add "use strict;" and "use warnings;" at the top of the script: no quotes around "accts_file.txt", "$accts" is used only once, grep -T,<$sample_dirtoget/*> should have curlies around -T and no comma, and you didn't assign a value to $sample_dirtoget. You do opendir, but then you use a glob instead of readdir (so opendir was unnecessary). There's probably more stuff like that, but you get the idea...
Doing a bunch of back-ticked grep commands inside your while loop isn't such a good solution, especially since you are not doing any error checking on those commands. In fact, I think you've lost your train of thought there. This probably is not really doing what you set out to do.
And I'm not quite sure I understand what you're trying to do with the %layout hash. Are those the actual file names you are using as hash keys? What if there's a file whose name doesn't match one of those keys? (Why get file names from a glob or readdir if you have the names in a hash?)
If the account number field is always nine digits (bounded by non-digit characters), is it the case that other fields in any given file would never contain exactly nine digits (bounded by non-digit characters)? If so, you could try something like this:
use strict;
use warnings;
# get the list of accounts numbers we're looking for:
my %target_acc;
my $target_file = "accts_file.txt";
open( TARGS, "<", $target_file ) or die "$target_file: $!";
while (<TARGS>) {
chomp;
$target_acc{$_} = undef; # keep target account #' as hash keys
}
close TARGS;
# get the list of files we want to search over:
my $datadir = "/my/data/dir/with/files";
opendir( DIR, $datadir ) or die "$datadir: $!";
my @files = grep {-T "$datadir/$_"} readdir DIR;
closedir DIR;
# open each file to be searched, output found lines to
# a corresponding NEW file in the current directory
for my $file ( @files ) {
open( OUT, ">", "NEW_$file" ) or die "NEW_$file: $!";
if ( open( IN, "<", "$datadir/$file" )) {
while (<IN>) {
print if (( /^(\d{9})\|/ or /\|(\d{9})[|\r\n]/ )
and exists( $target_acc{$1} ));
}
close IN;
}
close OUT;
}
That hasn't been tested, but it does compile properly with strict and warnings enabled. Instead of using substring or split, I'm just trying a regex on each line of each file being searched, to match a 9-digit string either at the beginning of the line and followed by "|", or anywhere in the line, preceded by "|" and followed by "|" or newline. Then I see whether the 9-digit string that matched happens to be one of the numbers of interest, simply by checking the %target_acc hash. | [reply] [d/l] [select] |
How do you know where the account number is in a given file? If you have a header row, then the solution is easy, so I assume you don't. If that's the case, you will have to determine some way to tell where the account number is in a given file. For example, it sounds like your fields are fixed length. Are you guaranteed that the account number field will be the only field that is 9 characters long?
If you have to manually inspect each file to find the account number field, then I am not sure how you would expect to solve the problem progamatically. | [reply] |
The account number is a fixed 9 digits that is always the first 9 digits in 7 of the files. For the remaining files also it is always in a fixed location on each line of the file.
| [reply] |
I forgot to note 1 additional comment: the number of fields may vary according to the file. Although my sample data illustrates only 3 columns, this is not true for the files I have.
| [reply] |
I've created some sample data to illustrate what i'm speaking of. I hope this helps?
1 of 3 exception files look like this:
234abc|00011|123456789
224abc|01011|122456789
244abc|10011|123356789
254abc|11011|123446789
Other 7 files are in this format:
123456789|234abc|00011
123456789|234abc|00011
123456789|234abc|00011
122456789|224abc|01011
123356789|244abc|10011
123356789|244abc|10011
123446789|254abc|11011
accts.txt
123456789
122456789
123356789
123446789
| [reply] |