Re: retrieving information from a set of files

Perl is quite close to ideal for this sort of work.

My first tendency would be to use split; this means I get avoid playing games with substr (If I had to do that, I don't see much point is using Perl 8-)).

substr will work. The only issue, then, is how to determine which start/length pair to use for each file. This is easy (pseudo-code, not real code follows!):

my %layout = (file1 => 0, file2 => 0, ... file8 => 20,
              file9 ->10, file10 => 11);

#note that Perl starts counting at zero, so 1 had to be subtracted fro
+m the starting positions.

# now, for each file, you could yank the acct number out like this:

$acct = substr($line, $layout{$file_name}, 9);

# where $file_name corresponds to file1, file2, etc, in %layout as app
+ropriat
[download]

Now, if your file names are not fixed, but use a predictable naming convention, the problem is slightly different, but still easy. Magic cookies would be similarly simple. If the files have column headers, you could parse that to find which field has account numbers. This is slightly trickier.

So, you code would have this sort of logic:

Read list of account numbers. It's small, so store it in an array (open(my $acct, "<",$account_to_check);@acct_num = <$acct>;)
Open one info file (there are ten...). Use the %layout hash to find where the account number starts.
Using substr and the %layout table, yank out the account number. If it's on your list (grep is perfect for this), perform your extract, and either print or store the information. If it's not, use next to skip to the next record.
Repeat as needed

Again, my preference is generally to use split for simple delimited files; if you've got fields which include the pipe symbol (A|B|C|"this has a | pipe symbol but is one field"), you'll be better served by using a module, such as Text::CSV (I'm not endorsing that one; the only delimited files I've dealt with, I've controlled, so I knew, with certainty, that I wouldn't have fields where the delimiter should be ignored).

emc

At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

—Igor Sikorsky, reported in AOPA Pilot magazine February 2003.

Comment on Re: retrieving information from a set of files Download Code

Replies are listed 'Best First'.
Re^2: retrieving information from a set of files by Anonymous Monk on Oct 21, 2006 at 03:24 UTC
Will this work? $accts = accts_file.txt; $dirtoget = "/my/data/dir/with/files"; opendir(FILEDIR, $dirtoget) \|\| die("Cannot open directory"); @thefiles = grep -T, <$sample_dirtoget/*>; closedir(FILEDIR); my %layout = (file1 => 0, file2 => 0, file3 => 0, file4 => 0, file5 => + 0, file6 => 0, file6 => 0, file8 => 20, file9 ->10, file10 => 11); foreach $file (@thefiles) { $sample_filename = substr($file, 57); open(FILE, $file) \|\| die ("Cannot open $file:$!"); while ($line = <FILE>){ chomp $line; $acct = substr($line, $layout{$file}, 9); `grep $acct $accts_file.txt >> NEW_$sample_filename`; } close(FILE); } [download]	[reply] [d/l]
Re^3: retrieving information from a set of files by graff (Chancellor) on Oct 21, 2006 at 05:32 UTC
Have you tried it? Did it work? If not, how did it fail? There are some obvious problems, most of which would be revealed to you if you add "use strict;" and "use warnings;" at the top of the script: no quotes around "accts_file.txt", "$accts" is used only once, `grep -T,<$sample_dirtoget/*>` should have curlies around -T and no comma, and you didn't assign a value to $sample_dirtoget. You do opendir, but then you use a glob instead of readdir (so opendir was unnecessary). There's probably more stuff like that, but you get the idea... Doing a bunch of back-ticked grep commands inside your while loop isn't such a good solution, especially since you are not doing any error checking on those commands. In fact, I think you've lost your train of thought there. This probably is not really doing what you set out to do. And I'm not quite sure I understand what you're trying to do with the %layout hash. Are those the actual file names you are using as hash keys? What if there's a file whose name doesn't match one of those keys? (Why get file names from a glob or readdir if you have the names in a hash?) If the account number field is always nine digits (bounded by non-digit characters), is it the case that other fields in any given file would never contain exactly nine digits (bounded by non-digit characters)? If so, you could try something like this: use strict; use warnings; # get the list of accounts numbers we're looking for: my %target_acc; my $target_file = "accts_file.txt"; open( TARGS, "<", $target_file ) or die "$target_file: $!"; while (<TARGS>) { chomp; $target_acc{$_} = undef; # keep target account #' as hash keys } close TARGS; # get the list of files we want to search over: my $datadir = "/my/data/dir/with/files"; opendir( DIR, $datadir ) or die "$datadir: $!"; my @files = grep {-T "$datadir/$_"} readdir DIR; closedir DIR; # open each file to be searched, output found lines to # a corresponding NEW file in the current directory for my $file ( @files ) { open( OUT, ">", "NEW_$file" ) or die "NEW_$file: $!"; if ( open( IN, "<", "$datadir/$file" )) { while (<IN>) { print if (( /^(\d{9})\\|/ or /\\|(\d{9})[\|\r\n]/ ) and exists( $target_acc{$1} )); } close IN; } close OUT; } [download] That hasn't been tested, but it does compile properly with strict and warnings enabled. Instead of using substring or split, I'm just trying a regex on each line of each file being searched, to match a 9-digit string either at the beginning of the line and followed by "\|", or anywhere in the line, preceded by "\|" and followed by "\|" or newline. Then I see whether the 9-digit string that matched happens to be one of the numbers of interest, simply by checking the %target_acc hash.	[reply] [d/l] [select]