Re: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records

Your program logic is actually pretty good. Without more information on what the school name can be, I'd say the only way to determine whether there is a school specified -- if other non-school-address data is allowed after "/data_..." -- is so check your second regex capture against known-good or known-bad values.

Or, if it's always a US-based school web address, then checking that it ends in .edu (or I guess .net, .com, or .org, and maybe .us) might be enough.

As for readability and best practices, I'd write that snippet as follows, tested with the provided data:

#! /usr/bin/perl;
use strict;
use warnings;
use autodie; # errors from open made fatal

sub DEBUG { 1 }

my $filelist = 'tmp.txt';
open my $filelist_handle, '<', $filelist;
while (<$filelist_handle>) {
    chomp;
    my ($type, $school) = m!
      ^                 # anchor to beginning
      /home/test/       # common to all lines
      (\w{3})           # capture 'type'
      /\.date_[^.]+     # common to all lines
      (?:               # non-capturing group
        .+?(\w+)\.\w+$  # capture domain name?
        |               # or don't capture
      )                 # end group
    !x # /x flag means ignore white space in pattern
    or next; # skip line if it doesn't match
    
    # do extra check that $school is acceptable
    $school //= 'null'; # regex gives undef if not found
    
    if (DEBUG) {
        print "match: $_\n";
        print "\ttype: $type\n";
        print "\tschool: $school\n";
    } else {
        open my $line_handle, '<', $_;
        while (<$line_handle>) {
            print "Type:$type:School:$school:File:$_\n";
        }
    }
}
[download]

Example debug output:

match: /home/test/abc/.date_run_dir
        type: abc
        school: null
match: /home/test/def/.date_run_dir
        type: def
        school: null
match: /home/test/abc/.date_file_sent.email@wolverine.cole.edu
        type: abc
        school: cole
match: /home/test/abc/.date_file_sent.dp3.drew.net
        type: abc
        school: drew
match: /home/test/def/.date_file_sent.email@wolverine.cole.edu
        type: def
        school: cole
match: /home/test/def/.date_file_sent.dp3.drew.net
        type: def
        school: drew
[download]

Comment on Re: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records Select or Download Code

Replies are listed 'Best First'.
Re^2: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by JaeDre619 (Acolyte) on Dec 11, 2010 at 16:04 UTC
Thank you. This is great. Thanks a lot for breaking down that cryptic regex as well. I did try and test this, but had some issues with: `# do extra check that $school is acceptable $school //= 'null'; # regex gives undef if not found` [download] Error msg: `Search pattern not terminated` I commented that out and it run, although it had these errors: `match: /home/test/abc/.date_run_dir type: abc Use of uninitialized value in concatenation (.) or string at ./test7.p +l line 31, <$_[...]> line 1. school: match: /home/test/def/.date_run_dir type: def Use of uninitialized value in concatenation (.) or string at ./test7.p +l line 31, <$_[...]> line 2. school:` [download] Also would you pls show me to extract values from the files I match? Can I do this in the same pass that I peform the regex? Example values from file (.date_run_dir, etc) `$ cat .date_run_dir .date_file_sent.* /project/school/data/feed_abc_2010120816.ext3 mail_abc.dat.2010120816.ext3 mail_abc.dat.2010120816.ext3` [download]	[reply] [d/l] [select]
Re^3: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by Anonymous Monk on Dec 11, 2010 at 22:10 UTC
Oh, sorry about that error. `//=` is only in Perl 5.10.0 and later, and I should have noted that. The statement is equivalent to `$school = 'null' unless defined $school;` For the values inside the listed files, you could use a similar regex (or build it and the original from another which contains the common parts of both) inside that inner while loop, yes?	[reply] [d/l] [select]
Re^4: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records by JaeDre619 (Acolyte) on Dec 12, 2010 at 00:07 UTC
I wouldn't need another regex inside the while loop. At this point, the regex you helped me with list all the files I need to read in at the same time extracting those particular keys I wanted. Now, since I'm already in I wouldn't need another regex. I just need to open the filenames and get the values. I hope I am making sense. Would I do this within the while loop of the of the if (DEBUG) section? UPDATE: Nevermind. Thanks for your help! This does what I need it to do. I haven't really used DEBUG before. Once I changed it to 1 it does what I need to do. This is a cool script. I can always use that debug technique. Thanks for showing me the ways.	[reply]