leighgable has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Sorry about this title, but I don't know what the correct terminology for my current woes would be. Any assistance is greatly appreciated.

I am using some subroutines to grab data out of xml files, and then return some summary information such as "total number of article elements", but I am also using an unique reference ID to check for duplicate article elements.

It's important that the program keeps track of what files it's parsing, and returns the corresponding summary data with the right file.

As you can see if you run my program on the directory of sample xml data, the results correspond to the file before the one listed, and whatever the first file, I get double zero result for the first iteration while the program "revs up".
I know my error must be a simple one, but I have had no luck finding the solution, maybe for lack of knowing the terminology. Any clues would be greatly appreciated.

I am new to programming, just trying to stumble my way through this project, so if you see something in my code that makes you roll your eyes, please let me know. I'm looking for helpful criticism.

All the best,

Leigh

Update: I changed the download link to a site with no registration, and posted a data snippet at the bottom, also ran my code through perltidy. Thanks for tips.

Here is the data.

#!/usr/bin/perl # turn on perl safety features use strict; use warnings; #initialize modules use XML::Twig; use DirHandle; my ($dir, $filepath, @filename, @filepath_list, %company_hist, @co +mpany_list, %reference_hist, @reference_list); $dir = $ARGV[0] or die "Must specify directory"; @filepath_list = get_file_list($dir); foreach $filepath (@filepath_list) { my $twig = XML::Twig -> new( twig_roots => { 'article/reference' => \&get_ref, company => \&get_code }); $company_hist{$_}++ for @company_list; #sort results from my @unique_comp = keys %company_hist; #"get_code" sub then @company_list = ( sort { $company_hist{$b} <=> #return 3 most f +req $company_hist{$a} } @unique_comp )[0..2]; #codes. #my $ref_length = scalar( @reference_list ); #take ref list and $reference_hist{$_}++ for @reference_list; #eliminate duplicate +s my @unique_ref = keys %reference_hist; #return a tally my $uni_count = scalar ( @unique_ref ); my @by_date_tally = get_date(\@unique_ref); my $dup_count = (scalar( @reference_list ) - $uni_count); #my($k, $v); #while ( ($k,$v) = each %reference_hist ) { #print "$k => $v\n"; #} #print "File name: ", print_file_name($filepath), "\n"; #print "Total Dupicate Articles: $dup_count\n"; #print "Total Articles Found: $uni_count\n"; #print "@company_list\n"; undef %company_hist; #reinitialize global undef @company_list; #vars. undef %reference_hist; # undef @reference_list; # $twig->parsefile($filepath); # purge to save mem. $twig->purge; } #end of foreach loop exit(0); sub get_file_list { $dir = shift; print $dir, "\n"; my $dh = DirHandle->new($dir) or die "can't open directory"; return sort # sort pathnames grep { -f } # choose only files map { "$dir/$_" } # create full paths grep { !/^\./ } # filter out dot files $dh->read(); # read all filenames } sub print_file_name { my($path, $position, $path_strip); #take filepath and $path = $_[0]; #return filename. $position = rindex($path,"/") + 1; # $path_strip = substr($path, $position); #print "For file: $path_strip\n"; return $path_strip; } sub get_code { my $company; #get company code my( $twig, $elt)= @_; #attribute and $company = $elt->{'att'}->{'code'}; #put into array push @company_list, $company; #return @company_list; } sub get_ref { my( $twig, $elt)= @_; #take reference elt my $ref = $elt; #and return just the my $position = rindex($ref->text(), "/") + 1; #reference ID str +ing my $ref_strip = substr($ref->text(), $position); push @reference_list, $ref_strip; } sub get_date { my $ref; #my @refs = @_; foreach $ref (@_) { print @$ref, "\t"; } print "\n\n"; }

And here is some sample data. Normally it's in several files in a directory.
<article> <accessionno>MTPW000020090731e57v004mr</accessionno> <reference>distdoc:archive/ArchiveDoc::Article/MTPW000020090 +731e57v004mr</reference> <baselanguage>EN</baselanguage> <copyright>(c) 2009 M2 Communications, Ltd. All Rights Reser +ved. </copyright> <headline> <paragraph display="Proportional" truncation="None" lang=" +EN">Anadys Pharmaceuticals, Inc (NASDAQ:ANDS) is the Highest Percenta +ge Gainers Among NASDAQ Stocks During Morning Trading Hours; Microsof +t Corporation (NASDAQ:<hlt>MSFT</hlt>) And Orthofix International NV +(NASDAQ:OFIX) Round Out Top Three Percentage Gainers During Morning T +rading Hours</paragraph> </headline> <publicationdate> <date>2009-07-31</date> </publicationdate> <sourcename>M2 Presswire</sourcename> <company code="oficks"> <name>Orthofix International N.V.</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <company code="scrptg"> <name>Anadys Pharmaceuticals Inc</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <company code="mcrost"> <name>Microsoft Corporation</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <industry code="i3302"> <name>Computers/Electronics</name> </industry> <industry code="i3302021"> <name>Applications Software</name> </industry> <industry code="i257"> <name>Pharmaceuticals</name> </industry> <industry code="i330202"> <name>Software</name> </industry> <industry code="icomp"> <name>Computing</name> </industry> <industry code="i3302020"> <name>Systems Software</name> </industry> <industry code="i372"> <name>Medical Equipment/Supplies</name> </industry> <industry code="i951"> <name>Health Care</name> </industry> <industry code="iphmed"> <name>Medical/Surgical Instruments/Apparatus/Devices</name +> </industry> <region code="usa"> <name>United States</name> </region> <region code="namz"> <name>North American Countries/Regions</name> </region> <newssubject code="c42" position="0"> <name>Labor/Personnel Issues</name> </newssubject> <newssubject code="ghepat" position="0"> <name>Hepatitis</name> </newssubject> <newssubject code="mstock" position="0"> <name>Stock Exchanges</name> </newssubject> <newssubject code="npress" position="0"> <name>Press Release</name> </newssubject> <newssubject code="ccat" position="0"> <name>Corporate/Industrial News</name> </newssubject> <newssubject code="gcat" position="0"> <name>Political/General News</name> </newssubject> <newssubject code="ghea" position="0"> <name>Health</name> </newssubject> <newssubject code="gmed" position="0"> <name>Medical Conditions</name> </newssubject> <newssubject code="m11" position="0"> <name>Equity Markets</name> </newssubject> <newssubject code="mcat" position="0"> <name>Commodity/Financial Market News</name> </newssubject> <newssubject code="ncat" position="0"> <name>Content Types</name> </newssubject> <newssubject code="nfact" position="0"> <name>Factiva Filters</name> </newssubject> <newssubject code="nfce" position="0"> <name>FC&amp;E Exclusion Filter</name> </newssubject> <newssubject code="nfcpin" position="0"> <name>FC&amp;E Industry News Filter</name> </newssubject> <sourcecode>MTPW</sourcecode> <tailparagraphs> </tailparagraphs> <contact>Liquid Tycoon | e-mail: info@LiquidTycoon.com | Tel: +1 214 556 6798 </contact> <logo source="http://logos.factiva.com.ezproxy.insead.edu" i +mage="mtpwLogo.gif"></logo> <wordcount>679</wordcount> </article>

Replies are listed 'Best First'.
Re: Returning the right results, but at the wrong time from subroutines
by moritz (Cardinal) on Sep 06, 2009 at 22:08 UTC
    From a quick glance over your code I didn't find your problem, but two things I noticed:

    First, you use code like this:

    my @variable; for (... ) { ... undef @variable; }

    It is much saner to declare @variable inside the for-lop instead, so it gets automatically cleaned when the loop body is exited:

    for (...) { my @variable; ... # no manual cleanup necessary. }

    So you keep the scope of your variables small, you can't forget to clean up, and if you exit the current iteration with next, last or continue it is still cleaned up.

    Update: I just noticed that you access these variables from subroutines too. Then you also have to declare this subroutine inside the loop - which is a good idea anyway, because then you don't get confusing "action at a distance". So the start of your loop might look like this:

    foreach $filepath (get_file_list($dir)) { my (%company_hist, @company_list, %reference_hist, @reference_list +) ; my $twig = XML::Twig -> new( twig_roots => { company => sub { push @company_list, $_[1]->{'att'}->{'co +de'} }, ... } );

    The second thing I noticed was some commented out debugging code:

    #my($k, $v); #while ( ($k,$v) = each %reference_hist ) { #print "$k => $v\n"; #}

    For debugging Data::Dumper (a core module) is very helpful, you can just write

    use Data::Dumper; ... print Dumper \%reference_hist;

    to get good debugging output for just about any data structure.

    I hope this is useful to you. And great to the see self-styled beginner using strict and warnings!

    Perl 6 - links to (nearly) everything that is Perl 6.
      Wow!

      Thanks for these comments. I could sense that my "action at a distance" solution was probably not kosher, and I really like how you call an anonymous function in twig_roots and assign to the array from there. Really, thanks to both of you for commenting. I took the advice from your colleague below and updated my post with a link to a sample data file that is downloadable without registering, downloaded perltidy, ran my code through it, and reposted that, and added a snippet of data from one of the files I am parsing.

      Much thanks to both of you.
Re: Returning the right results, but at the wrong time from subroutines
by toolic (Bishop) on Sep 06, 2009 at 23:03 UTC
    Also completely un-related to your problem... you can eliminate your print_file_name sub, and use the basename function from the File::Basename module instead. Less code for you to maintain, and it's more portable.

    Regarding your problem... if it is possible for you to post a very small sample of your XML code here, please do so, as that would allow any of us to easily duplicate your problem. I followed your link, but it asked me to create some kind of account. That's too much of a commitment for me :)

    Also, your code layout is a little difficult to follow. I used perltidy to clean it up a bit. You could do that also, and re-post, if you are so-inclined.

    Update: I believe you should call $twig->parsefile() immediately after you call the twig constructor (new()), not near the end of your foreach loop.

      Thanks for the comments. I have updated my post to include an easier download of the sample data directory, and posted a snippet of the aforementioned xml.
Re: Returning the right results, but at the wrong time from subroutines
by Jenda (Abbot) on Sep 07, 2009 at 23:12 UTC

    Looks like within the loop you first create the XML::Twig parser, then use the global variables containing the extracted data, then you clear those variables and THEN you parse the XML file and fill in the globals.

    That doesn't look right. And explains the behaviour. For the first iteration the globals are empty, for subsequent iterations they contain the data extracted from the previous file.

    You want to move the ->parsefile() and ->purge() calls immediately under the my $twig = XML::Twig -> new(...);.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.