retrieving information from a set of files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: retrieving information from a set of files by swampyankee (Parson) on Oct 20, 2006 at 22:22 UTC
Perl is quite close to ideal for this sort of work. My first tendency would be to use split; this means I get avoid playing games with substr (If I had to do that, I don't see much point is using Perl 8-)). substr will work. The only issue, then, is how to determine which start/length pair to use for each file. This is easy (pseudo-code, not real code follows!): `my %layout = (file1 => 0, file2 => 0, ... file8 => 20, file9 ->10, file10 => 11); #note that Perl starts counting at zero, so 1 had to be subtracted fro +m the starting positions. # now, for each file, you could yank the acct number out like this: $acct = substr($line, $layout{$file_name}, 9); # where $file_name corresponds to file1, file2, etc, in %layout as app +ropriat` [download] Now, if your file names are not fixed, but use a predictable naming convention, the problem is slightly different, but still easy. Magic cookies would be similarly simple. If the files have column headers, you could parse that to find which field has account numbers. This is slightly trickier. So, you code would have this sort of logic: Read list of account numbers. It's small, so store it in an array (open(my $acct, "<",$account_to_check);@acct_num = <$acct>;) Open one info file (there are ten...). Use the %layout hash to find where the account number starts. Using substr and the %layout table, yank out the account number. If it's on your list (grep is perfect for this), perform your extract, and either print or store the information. If it's not, use next to skip to the next record. Repeat as needed Again, my preference is generally to use split for simple delimited files; if you've got fields which include the pipe symbol (A\|B\|C\|"this has a \| pipe symbol but is one field"), you'll be better served by using a module, such as Text::CSV (I'm not endorsing that one; the only delimited files I've dealt with, I've controlled, so I knew, with certainty, that I wouldn't have fields where the delimiter should be ignored). emc At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation. —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.	[reply] [d/l]
Re^2: retrieving information from a set of files by Anonymous Monk on Oct 21, 2006 at 03:24 UTC
Will this work? $accts = accts_file.txt; $dirtoget = "/my/data/dir/with/files"; opendir(FILEDIR, $dirtoget) \|\| die("Cannot open directory"); @thefiles = grep -T, <$sample_dirtoget/*>; closedir(FILEDIR); my %layout = (file1 => 0, file2 => 0, file3 => 0, file4 => 0, file5 => + 0, file6 => 0, file6 => 0, file8 => 20, file9 ->10, file10 => 11); foreach $file (@thefiles) { $sample_filename = substr($file, 57); open(FILE, $file) \|\| die ("Cannot open $file:$!"); while ($line = <FILE>){ chomp $line; $acct = substr($line, $layout{$file}, 9); `grep $acct $accts_file.txt >> NEW_$sample_filename`; } close(FILE); } [download]	[reply] [d/l]
Re^3: retrieving information from a set of files by graff (Chancellor) on Oct 21, 2006 at 05:32 UTC
Have you tried it? Did it work? If not, how did it fail? There are some obvious problems, most of which would be revealed to you if you add "use strict;" and "use warnings;" at the top of the script: no quotes around "accts_file.txt", "$accts" is used only once, `grep -T,<$sample_dirtoget/*>` should have curlies around -T and no comma, and you didn't assign a value to $sample_dirtoget. You do opendir, but then you use a glob instead of readdir (so opendir was unnecessary). There's probably more stuff like that, but you get the idea... Doing a bunch of back-ticked grep commands inside your while loop isn't such a good solution, especially since you are not doing any error checking on those commands. In fact, I think you've lost your train of thought there. This probably is not really doing what you set out to do. And I'm not quite sure I understand what you're trying to do with the %layout hash. Are those the actual file names you are using as hash keys? What if there's a file whose name doesn't match one of those keys? (Why get file names from a glob or readdir if you have the names in a hash?) If the account number field is always nine digits (bounded by non-digit characters), is it the case that other fields in any given file would never contain exactly nine digits (bounded by non-digit characters)? If so, you could try something like this: use strict; use warnings; # get the list of accounts numbers we're looking for: my %target_acc; my $target_file = "accts_file.txt"; open( TARGS, "<", $target_file ) or die "$target_file: $!"; while (<TARGS>) { chomp; $target_acc{$_} = undef; # keep target account #' as hash keys } close TARGS; # get the list of files we want to search over: my $datadir = "/my/data/dir/with/files"; opendir( DIR, $datadir ) or die "$datadir: $!"; my @files = grep {-T "$datadir/$_"} readdir DIR; closedir DIR; # open each file to be searched, output found lines to # a corresponding NEW file in the current directory for my $file ( @files ) { open( OUT, ">", "NEW_$file" ) or die "NEW_$file: $!"; if ( open( IN, "<", "$datadir/$file" )) { while (<IN>) { print if (( /^(\d{9})\\|/ or /\\|(\d{9})[\|\r\n]/ ) and exists( $target_acc{$1} )); } close IN; } close OUT; } [download] That hasn't been tested, but it does compile properly with strict and warnings enabled. Instead of using substring or split, I'm just trying a regex on each line of each file being searched, to match a 9-digit string either at the beginning of the line and followed by "\|", or anywhere in the line, preceded by "\|" and followed by "\|" or newline. Then I see whether the 9-digit string that matched happens to be one of the numbers of interest, simply by checking the %target_acc hash.	[reply] [d/l] [select]
Re: retrieving information from a set of files by duckyd (Hermit) on Oct 20, 2006 at 21:23 UTC
How do you know where the account number is in a given file? If you have a header row, then the solution is easy, so I assume you don't. If that's the case, you will have to determine some way to tell where the account number is in a given file. For example, it sounds like your fields are fixed length. Are you guaranteed that the account number field will be the only field that is 9 characters long? If you have to manually inspect each file to find the account number field, then I am not sure how you would expect to solve the problem progamatically.	[reply]
Re^2: retrieving information from a set of files by Anonymous Monk on Oct 20, 2006 at 21:43 UTC
The account number is a fixed 9 digits that is always the first 9 digits in 7 of the files. For the remaining files also it is always in a fixed location on each line of the file.	[reply]
Re^3: retrieving information from a set of files by Anonymous Monk on Oct 20, 2006 at 21:55 UTC
I forgot to note 1 additional comment: the number of fields may vary according to the file. Although my sample data illustrates only 3 columns, this is not true for the files I have.	[reply]
Re^2: retrieving information from a set of files by Anonymous Monk on Oct 20, 2006 at 21:52 UTC
I've created some sample data to illustrate what i'm speaking of. I hope this helps? 1 of 3 exception files look like this: 234abc\|00011\|123456789 224abc\|01011\|122456789 244abc\|10011\|123356789 254abc\|11011\|123446789 Other 7 files are in this format: 123456789\|234abc\|00011 123456789\|234abc\|00011 123456789\|234abc\|00011 122456789\|224abc\|01011 123356789\|244abc\|10011 123356789\|244abc\|10011 123446789\|254abc\|11011 accts.txt 123456789 122456789 123356789 123446789	[reply]