Or, if it's always a US-based school web address, then checking that it ends in .edu (or I guess .net, .com, or .org, and maybe .us) might be enough.
As for readability and best practices, I'd write that snippet as follows, tested with the provided data:
#! /usr/bin/perl; use strict; use warnings; use autodie; # errors from open made fatal sub DEBUG { 1 } my $filelist = 'tmp.txt'; open my $filelist_handle, '<', $filelist; while (<$filelist_handle>) { chomp; my ($type, $school) = m! ^ # anchor to beginning /home/test/ # common to all lines (\w{3}) # capture 'type' /\.date_[^.]+ # common to all lines (?: # non-capturing group .+?(\w+)\.\w+$ # capture domain name? | # or don't capture ) # end group !x # /x flag means ignore white space in pattern or next; # skip line if it doesn't match # do extra check that $school is acceptable $school //= 'null'; # regex gives undef if not found if (DEBUG) { print "match: $_\n"; print "\ttype: $type\n"; print "\tschool: $school\n"; } else { open my $line_handle, '<', $_; while (<$line_handle>) { print "Type:$type:School:$school:File:$_\n"; } } }
Example debug output:
match: /home/test/abc/.date_run_dir type: abc school: null match: /home/test/def/.date_run_dir type: def school: null match: /home/test/abc/.date_file_sent.email@wolverine.cole.edu type: abc school: cole match: /home/test/abc/.date_file_sent.dp3.drew.net type: abc school: drew match: /home/test/def/.date_file_sent.email@wolverine.cole.edu type: def school: cole match: /home/test/def/.date_file_sent.dp3.drew.net type: def school: drew
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Help constructing proper regex to extract values from filenames and concurrently opening those same files to access records
by JaeDre619 (Acolyte) on Dec 11, 2010 at 16:04 UTC | |
by JaeDre619 (Acolyte) on Dec 12, 2010 at 00:07 UTC |