zarath has asked for the wisdom of the Perl Monks concerning the following question:

Hello

Still a bit new to Perl here, but learning pretty quickly.

Please consider the following piece of code.

The context and what I'm supposed to do with this:

I have several .xml-files (here FILE) and 1 .txt-file (here QR).

Every FILE is a .xml-file about a person, but these files lack the date of birth, it is my assignment to complete these files by adding this date of birth.

This is where the file QR steps in, this contains FirstName, LastName, Email and BirthDate of every one of these people.

So with the code underneath I am trying to match FirstName, LastName and Email in both FILE and QR to extract the BirthDate that way and paste it into FILE.

This question would not be here if it worked... Apparently $firstname, $lastname and $email need to be defined in the sub fetchBD. But if you look at the main piece of code, these 3 are defined by the time this sub gets called, so I don't understand why this is an issue.

I am not used to using subroutines, so it just might be a bad choice to throw them in, it just seemed to me that is would be the easiest way of doing what I'm supposed to do. Definitely feel free to disagree here and all tips are welcome.

2 notes that are less important:

1. use of index() and substr() are to my knowledge the safest way to fetch a specific string if it is surrounded by a known constant, any tips on easier / cleaner-looking ways are of course welcome.

2. the use of a triple 'if'-block seems to me the safest way to make sure all 3 strings are matched from the same FILE and from the same QR-line. Again, any tips on easier / cleaner-looking ways are of course welcome.

use strict; # because we should use warnings; # because we should use autodie; # die if problem reading or writing a file my $base = 'D:/Some/Specific/Folder'; open (QR, '<', $oldqr); for my $file (glob qq($base/ToUpload/Staging/*)) { open FILE,'+<',$file; binmode (FILE); while (my $line = <FILE>){ my $firstname = getFN(); my $lastname = getLN(); my $email = getEM(); if ($line =~ /<birthdate>/) { my $bdate = fetchBD(); print FILE ' <birthdate>'.$bdate.'</birthdate>'; } } } close QR; sub fetchBD { while (my $line = <QR>){ my $fns = index($line,'FirstName'); $fns += 10; my $fne = index($line,' LastName'); my $fnl = $fne - $fns; my $fn = substr($line,$fns,$fnl); my $lns = index($line,'LastName'); $lns += 10; my $lne = index($line,' Email'); my $lnl = $lne - $lns; my $ln = substr($line,$lns,$lnl); my $ems = index($line,' Email '); $ems += 7; my $eme = index($line,' BirthDate'); my $eml = $eme - $ems; my $em = substr($line,$ems,$eml); if ($fn = $firstname) { if ($ln = $lastname) { if ($em = $email) { my $bdates = index($line,' BirthDate '); $bdates += 11; my $bdate = substr($line,$bdates,10); return $bdate; } } } } } sub getFN{ while (my $line = <FILE>){ if ($line =~ /<firstname>/) { my $fnse = index($line,'<firstname>'); $fnse += 11; my $fnen = index($line,'</firstname>'); my $fnlo = $fnen - $fnse; my $firstname = substr($line,$fnse,$fnlo); return $firstname; } } } sub getLN{ while (my $line = <FILE>){ if ($line =~ /<name>/) { my $lnse = index($line,'<name>'); $lnse += 6; my $lnen = index($line,'</name>'); my $lnlo = $lnen - $lnse; my $lastname = substr($line,$lnse,$lnlo); return $lastname; } } } sub getEM{ while (my $line = <FILE>){ if ($line =~ /<email>/) { my $emse = index($line,'<email>'); $emse += 7; my $emen = index($line,'</email>'); my $emlo = $emen - $emse; my $email = substr($line,$emse,$emlo); return $email; } } }

I am sure I'm missing something crucial here, but I'm having a hard time to keeping track of what I'm actually doing with this code.

Thank you for your help anyways!

Replies are listed 'Best First'.
Re: Matching specific strings (are subs a good idea?)
by Athanasius (Archbishop) on Sep 07, 2017 at 17:00 UTC

    Hello zarath,

    In Perl, a variable declared with my has lexical scope, meaning it is not visible outside the scope in which it is declared. So when you declare my $firstname inside the main for loop, that variable goes out of scope (i.e., cannot be seen) when the loop block ends. As the subroutines are outside of this scope, they can’t “see” $firstname. Actually, this is a GOOD THING, as it makes it less likely that $firstname will be accidentally overwritten in some unrelated part of the code.

    When using subroutines, it is standard practice to pass variables in like this:

    ... my $bdate = fetchBD($firstname, $lastname, $email); ... sub fetchBD { my ($firstname, $lastname, $email) = @_; while (my $line = <QR>){ ...

    This technique creates local copies of the 3 variables passed in. See perlintro#Writing-subroutines.

    But the main problem with the code shown is with the logic of the subroutines. When the while loop within the main for loop processes a line of input, it calls subroutines to extract the desired information. But within each of those subroutines, reading of that same file continues, line-by-line, in a another while loop, until the input is exhausted! In fact, the while loops in the subroutines are not needed. You should rather just pass in the input line along with the other data:

    ... my $bdate = fetchBD($line, $firstname, $lastname, $email); ... sub fetchBD { my ($line, $firstname, $lastname, $email) = @_; my $fns = index($line,'FirstName'); ...

    It will actually be a lot easier to extract the data using regular expressions than using index and substr. It will help if you give some sample data from one of the xml files and from the text file. (Also note that, as a general rule, it’s better to extract xml data using a dedicated module such as XML::LibXML.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Matching specific strings (are subs a good idea?)
by karlgoethebier (Abbot) on Sep 07, 2017 at 16:59 UTC

    You should consider to use something like XML::LibXML, XML::Twig or XML::LibXML::Simple for your XML files - if you are allowed.

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Matching specific strings (are subs a good idea?)
by Laurent_R (Canon) on Sep 07, 2017 at 20:21 UTC
    Hi zarath,

    in addition to what my fellow monks have already said above (with which I totally agree), your algorithm is very inefficient. Basically, for every single XML record in your XML files, you're reading sequentially the full QR file (and spend quite some effort decoding many times each of its lines). If your XML and/or QR file are large, this will take ages.

    The best way to solve this type of problem in Perl is usually to start by reading the QR file and decoding its lines only once, to store its content in a hash and to close it. Then only you read the XML file and look up for names, first name, and email in the hash to get the missing information (birth date). Hash lookup is very fast.

    There are various ways to store the data into a hash, but the simplest implementation for a beginner of the hash would probably be to concatenate first name, name and email (with separators) into a string to form the keys of the hash and to store the birth date as the value. Maybe something like this:

    ( "john|doe|john.d@hismail.com" => 20020924, "liz|schmoe|lizschmoe@hermail.uk" => 20040318, ... )
    Then, when you read the XML file, you pick up first name, name and email address, construct a string the same way you built the hash keys, and use it to lookup the hash; you'll find instantly the birth date that you need to add to the output XML files.

    With relatively large files, this will be literally orders of magnitude faster than your repeated loops through the QR file. And the code will actually be simpler.

    I can't show you this in detail right now because you haven't provided data examples. Please provide a sample of the QR file (just a few lines), and I (or some other monks) will be happy to show you how to do that. We'll also most probably be able to show you much simpler ways to to extract the relevant information from the lines of the QR file (what I called "decoding the lines" above).

Re: Matching specific strings (are subs a good idea?)
by zarath (Beadle) on Sep 08, 2017 at 07:56 UTC

    Thank you for the tips so far guys, I can definitely stop feeling stuck now :-)

    When posting this question I forgot it might actually help for getting the most specific advise if I'd add some snippets of the files used, so here they are

    what the .xml-files look like:

    <contractor customergroup="PAR"> <person> <firstname>Julie</firstname> <name>Keppen</name> <sex>F</sex> <birthdate></birthdate> <phone>0485651115</phone> <email>juliekeppen@gmail.com</email> </person> </contractor>

    What the .txt-file looks like:

    FirstName Julie LastName Keppen Email juliekeppen@gmail +.com BirthDate 1987-11-11 FirstName Amaury LastName Reinquin Email a.reinquin@out +look.com BirthDate 1991-08-24 FirstName Pierre LastName Vaucamps Email pierre.vaucamp +s@gmail.com BirthDate 1988-11-26 FirstName Stephanie-Katrien LastName Eggermont-Witpas Emai +l eggermont.st@gmail.com BirthDate 1900-01-01

    I have just changed the query to output this result in a more readable way:

    Julie;Keppen;juliekeppen@gmail.com;1987-11-11 Amaury;Reinquin;a.reinquin@outlook.com;1991-08-24 Pierre;Vaucamps;pierre.vaucamps@gmail.com;1988-11-26 Stephanie-Katrien;Eggermont-Witpas;eggermont.st@gmail.com;1900-01-01

    My gut tells me this last version might be easier when working with hashes.

    Have looked up stuff on hashes but it is not clear to me how to use them in this assignment, every time I read something on hashes, the first thing they do is store specific strings in them like this:

    %HoA = ( flintstones => [ "fred", "barney" ], jetsons => [ "george", "jane", "elroy" ], simpsons => [ "homer", "marge", "bart" ], );

    This is not very helpful to me, I'm not going to manually type in all names, may aswell manually look up the bdates and copy-paste them in the .xml then. But of course, I know I'm missing something essential in the explanation given about the hashes. Just don't know what.

      This works for the data you posted:
      #!/usr/bin/perl use warnings; use strict; use XML::LibXML; my $text_file = shift; open my $In, '<', $text_file or die $!; my $extract = join '\s+(.*?)\s*', qw( FirstName LastName Email BirthDa +te $ ); $extract = qr/$extract/; my %birthday; while (<$In>) { my ($first, $last, $email, $bdate) = /$extract/; $birthday{$email} = $bdate; } while (my $xml_file = shift) { my $dom = 'XML::LibXML'->load_xml(location => $xml_file); my $email = $dom->findvalue('/contractor/person/email'); unless (exists $birthday{$email}) { warn "No birthday for $email!\n"; next } my $person = $dom->find('/contractor/person'); my $bday = $person->[0]->addChild($dom->createElement('birthday')) +; $bday->appendText($birthday{$email}); $dom->toFile($xml_file); }

      See XML::LibXML for documentation.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      An example using XML::Twig

      #!/usr/bin/perl use strict; use autodie; use XML::Twig; use Data::Dumper; $Data::Dumper::Terse = 1; use constant DEBUG => 1; my $base = 'D:/Some/Specific/Folder'; # birthdate lookup table my $oldqr = 'c:/temp/oldqr.txt'; my $lookup = fetch_birthdates($oldqr); print '$lookup=',Dumper $lookup if DEBUG; # process XML files #my @files = glob( $base.'/ToUpload/Staging/*' ); my @files = ('test.xml'); print '@files=',Dumper \@files if DEBUG; for my $file (@files){ add_birthdate($file,$lookup); } # build birthdate lookup table sub fetch_birthdates { my $infile = shift; my %hash = (); my $count = 0; open IN,'<',$infile; # autodie while (<IN>){ ++$count; chomp; # old format #if (my @f = /FirstName(.*)LastName(.*)Email(.*)BirthDate(.*)/i){ # s/^\s+|\s+$//g for @f; # trim spaces my @f = split ';',$_; # new format my $pk = join ';',@f[0..2]; # create lookup key if (exists $hash{$pk}){ die "ERROR Duplicate record in $infile '$pk' line $count\n"; } else { $hash{$pk} = $f[3]; } #} } close IN; print "$count records read from $infile\n"; return \%hash; } # add birth date from lookup # using firstname;name;email as key sub add_birthdate { my ($file,$lookup) = @_; my $twig = XML::Twig->new( pretty_print => 'indented' ); $twig->parsefile( $file ); my ($person) = $twig->findnodes( 'person' ); my $pk = join ";", $person->findvalue('firstname'), $person->findvalue('name'), $person->findvalue('email'); print "Reading $file : "; if (exists $lookup->{$pk}){ my $birthdate = $lookup->{$pk}; my $node = $person->first_child( 'birthdate' ); $node->set_text( $birthdate ); print "$birthdate added for '$pk'\n"; open my $out,'>',$file.'.modified'; #autodie $twig->print($out); close $out; } else { print "ERROR - no birthdate for '$pk'\n"; } }
      poj