Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records

JaeDre619 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

It seems like I've been a novice for sometime. I use perl in spurts and forget a lot of things I just learned. I really want to get better at this. So, I'm always asking for best practices as most of the things I use perl for are for building small utilities to help me as database administrator.

On that note, I need your help with my perl script. Especially my regular expression. I have an input file (FileList.txt) that is fairly static and has a few lines. The entries of this input file are filenames with their path directories. The first (2) files listed have another file name within that file, but it doesn't have the school name embedded in its filename like the others.

Ex. (FileList.txt)

 
/home/test/abc/.date_run_dir
/home/test/def/.date_run_dir
/home/test/abc/.date_file_sent.email@wolverine.cole.edu
/home/test/abc/.date_file_sent.dp3.drew.net
/home/test/def/.date_file_sent.email@wolverine.cole.edu
/home/test/def/.date_file_sent.dp3.drew.net
[download]

For each file listed, I want to extract type abc or def and place in a variable. I also want to extract school names cole or drew as values in a variable. If there is no school names as seen in the first (2) files, then value should be named null

Also for each filename listed, I want the contents of those files in a variable. Each file has as single value name. Nothing complex.

Script thus far:

use strict;
#use warnings;

my $file = '/home/test/FileList.txt';
open my $FILE, '<', $file or die "unable to open '$file' for reading: 
+$!";
while (my $line = <$FILE>) {
    chomp($line);
    #if ($line =~ m#home/test/(\w{3}).*[.](\w+)[.].*#) {
    if ($line =~ m#home/test/(\w{3}).*[.](\w+)[.]?.*#) {
        #print "$line\n";  .last_file_sent*
        open my $file2, '<', $line or die "unable to open '$file' for 
+reading: $!";
        while(my $line2 = <$file2>) {
        print "Type:$1:School:$2:File:$line2";
        #print "$line2";
        }
        close $file2;
    }
} #end while
close $FILE;
[download]

Output:

(note, regex is capturing edu or net which is not what i want. Also regex is capturing date_run_dir which in this case if there is no school name in the file name, default to value of null.

Type:abc:School:date_run_dir:File:/product/classroom/subject/data/sysf
+eed_abc_2010120810.ext3
Type:def:School:date_run_dir:File:/product/classroom/subject/data/sysf
+eed_def_2010120806.ext3
Type:abc:School:edu:File:domain_abc.dat.2010120810.ext3
Type:abc:School:net:File:domain_abc.dat.2010120810.ext3
Type:def:School:edu:File:domain_def.dat.2010120805.ext3
Type:def:School:net:File:domain_def.dat.2010120804.ext3
[download]

Thanks for your help.

Comment on Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records Select or Download Code

Replies are listed 'Best First'.
Re: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by Anonymous Monk on Dec 11, 2010 at 08:28 UTC
Your program logic is actually pretty good. Without more information on what the school name can be, I'd say the only way to determine whether there is a school specified -- if other non-school-address data is allowed after "/data_..." -- is so check your second regex capture against known-good or known-bad values. Or, if it's always a US-based school web address, then checking that it ends in .edu (or I guess .net, .com, or .org, and maybe .us) might be enough. As for readability and best practices, I'd write that snippet as follows, tested with the provided data: #! /usr/bin/perl; use strict; use warnings; use autodie; # errors from open made fatal sub DEBUG { 1 } my $filelist = 'tmp.txt'; open my $filelist_handle, '<', $filelist; while (<$filelist_handle>) { chomp; my ($type, $school) = m! ^ # anchor to beginning /home/test/ # common to all lines (\w{3}) # capture 'type' /\.date_[^.]+ # common to all lines (?: # non-capturing group .+?(\w+)\.\w+$ # capture domain name? \| # or don't capture ) # end group !x # /x flag means ignore white space in pattern or next; # skip line if it doesn't match # do extra check that $school is acceptable $school //= 'null'; # regex gives undef if not found if (DEBUG) { print "match: $_\n"; print "\ttype: $type\n"; print "\tschool: $school\n"; } else { open my $line_handle, '<', $_; while (<$line_handle>) { print "Type:$type:School:$school:File:$_\n"; } } } [download] Example debug output: `match: /home/test/abc/.date_run_dir type: abc school: null match: /home/test/def/.date_run_dir type: def school: null match: /home/test/abc/.date_file_sent.email@wolverine.cole.edu type: abc school: cole match: /home/test/abc/.date_file_sent.dp3.drew.net type: abc school: drew match: /home/test/def/.date_file_sent.email@wolverine.cole.edu type: def school: cole match: /home/test/def/.date_file_sent.dp3.drew.net type: def school: drew` [download]	[reply] [d/l] [select]
Re^2: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by JaeDre619 (Acolyte) on Dec 11, 2010 at 16:04 UTC
Thank you. This is great. Thanks a lot for breaking down that cryptic regex as well. I did try and test this, but had some issues with: `# do extra check that $school is acceptable $school //= 'null'; # regex gives undef if not found` [download] Error msg: `Search pattern not terminated` I commented that out and it run, although it had these errors: `match: /home/test/abc/.date_run_dir type: abc Use of uninitialized value in concatenation (.) or string at ./test7.p +l line 31, <$_[...]> line 1. school: match: /home/test/def/.date_run_dir type: def Use of uninitialized value in concatenation (.) or string at ./test7.p +l line 31, <$_[...]> line 2. school:` [download] Also would you pls show me to extract values from the files I match? Can I do this in the same pass that I peform the regex? Example values from file (.date_run_dir, etc) `$ cat .date_run_dir .date_file_sent.* /project/school/data/feed_abc_2010120816.ext3 mail_abc.dat.2010120816.ext3 mail_abc.dat.2010120816.ext3` [download]	[reply] [d/l] [select]
Re^3: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by Anonymous Monk on Dec 11, 2010 at 22:10 UTC
Oh, sorry about that error. `//=` is only in Perl 5.10.0 and later, and I should have noted that. The statement is equivalent to `$school = 'null' unless defined $school;` For the values inside the listed files, you could use a similar regex (or build it and the original from another which contains the common parts of both) inside that inner while loop, yes?	[reply] [d/l] [select]
Re^4: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by JaeDre619 (Acolyte) on Dec 12, 2010 at 00:07 UTC
Re: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by Marshall (Canon) on Dec 12, 2010 at 09:24 UTC
If your data is as regular as it appears to be, a couple of splits instead of regexes will get the job done. #!/usr/bin/perl -w use strict; while (<DATA>) { chomp; my ($basedir, @dot_names) = split(/\./,$_); my $type = (split('/',$basedir))[-1]; my $school = "null"; $school=$dot_names[-2] if (@dot_names >1); #open the file here and use the variables as prefix for each line print "type: $type school: $school\n"; } =prints type: abc school: null type: def school: null type: abc school: cole type: abc school: drew type: def school: cole type: def school: drew =cut __DATA__ /home/test/abc/.date_run_dir /home/test/def/.date_run_dir /home/test/abc/.date_file_sent.email@wolverine.cole.edu /home/test/abc/.date_file_sent.dp3.drew.net /home/test/def/.date_file_sent.email@wolverine.cole.edu /home/test/def/.date_file_sent.dp3.drew.net [download]	[reply] [d/l]
Re^2: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by JaeDre619 (Acolyte) on Dec 12, 2010 at 16:31 UTC
@Marshall. Thanks! I can see how split can come in very handy as well. You made it look easy.	[reply]