Hello Monks,

Sorry about this title, but I don't know what the correct terminology for my current woes would be. Any assistance is greatly appreciated.

I am using some subroutines to grab data out of xml files, and then return some summary information such as "total number of article elements", but I am also using an unique reference ID to check for duplicate article elements.

It's important that the program keeps track of what files it's parsing, and returns the corresponding summary data with the right file.

As you can see if you run my program on the directory of sample xml data, the results correspond to the file before the one listed, and whatever the first file, I get double zero result for the first iteration while the program "revs up".
I know my error must be a simple one, but I have had no luck finding the solution, maybe for lack of knowing the terminology. Any clues would be greatly appreciated.

I am new to programming, just trying to stumble my way through this project, so if you see something in my code that makes you roll your eyes, please let me know. I'm looking for helpful criticism.

All the best,

Leigh

Update: I changed the download link to a site with no registration, and posted a data snippet at the bottom, also ran my code through perltidy. Thanks for tips.

Here is the data.

#!/usr/bin/perl # turn on perl safety features use strict; use warnings; #initialize modules use XML::Twig; use DirHandle; my ($dir, $filepath, @filename, @filepath_list, %company_hist, @co +mpany_list, %reference_hist, @reference_list); $dir = $ARGV[0] or die "Must specify directory"; @filepath_list = get_file_list($dir); foreach $filepath (@filepath_list) { my $twig = XML::Twig -> new( twig_roots => { 'article/reference' => \&get_ref, company => \&get_code }); $company_hist{$_}++ for @company_list; #sort results from my @unique_comp = keys %company_hist; #"get_code" sub then @company_list = ( sort { $company_hist{$b} <=> #return 3 most f +req $company_hist{$a} } @unique_comp )[0..2]; #codes. #my $ref_length = scalar( @reference_list ); #take ref list and $reference_hist{$_}++ for @reference_list; #eliminate duplicate +s my @unique_ref = keys %reference_hist; #return a tally my $uni_count = scalar ( @unique_ref ); my @by_date_tally = get_date(\@unique_ref); my $dup_count = (scalar( @reference_list ) - $uni_count); #my($k, $v); #while ( ($k,$v) = each %reference_hist ) { #print "$k => $v\n"; #} #print "File name: ", print_file_name($filepath), "\n"; #print "Total Dupicate Articles: $dup_count\n"; #print "Total Articles Found: $uni_count\n"; #print "@company_list\n"; undef %company_hist; #reinitialize global undef @company_list; #vars. undef %reference_hist; # undef @reference_list; # $twig->parsefile($filepath); # purge to save mem. $twig->purge; } #end of foreach loop exit(0); sub get_file_list { $dir = shift; print $dir, "\n"; my $dh = DirHandle->new($dir) or die "can't open directory"; return sort # sort pathnames grep { -f } # choose only files map { "$dir/$_" } # create full paths grep { !/^\./ } # filter out dot files $dh->read(); # read all filenames } sub print_file_name { my($path, $position, $path_strip); #take filepath and $path = $_[0]; #return filename. $position = rindex($path,"/") + 1; # $path_strip = substr($path, $position); #print "For file: $path_strip\n"; return $path_strip; } sub get_code { my $company; #get company code my( $twig, $elt)= @_; #attribute and $company = $elt->{'att'}->{'code'}; #put into array push @company_list, $company; #return @company_list; } sub get_ref { my( $twig, $elt)= @_; #take reference elt my $ref = $elt; #and return just the my $position = rindex($ref->text(), "/") + 1; #reference ID str +ing my $ref_strip = substr($ref->text(), $position); push @reference_list, $ref_strip; } sub get_date { my $ref; #my @refs = @_; foreach $ref (@_) { print @$ref, "\t"; } print "\n\n"; }

And here is some sample data. Normally it's in several files in a directory.
<article> <accessionno>MTPW000020090731e57v004mr</accessionno> <reference>distdoc:archive/ArchiveDoc::Article/MTPW000020090 +731e57v004mr</reference> <baselanguage>EN</baselanguage> <copyright>(c) 2009 M2 Communications, Ltd. All Rights Reser +ved. </copyright> <headline> <paragraph display="Proportional" truncation="None" lang=" +EN">Anadys Pharmaceuticals, Inc (NASDAQ:ANDS) is the Highest Percenta +ge Gainers Among NASDAQ Stocks During Morning Trading Hours; Microsof +t Corporation (NASDAQ:<hlt>MSFT</hlt>) And Orthofix International NV +(NASDAQ:OFIX) Round Out Top Three Percentage Gainers During Morning T +rading Hours</paragraph> </headline> <publicationdate> <date>2009-07-31</date> </publicationdate> <sourcename>M2 Presswire</sourcename> <company code="oficks"> <name>Orthofix International N.V.</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <company code="scrptg"> <name>Anadys Pharmaceuticals Inc</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <company code="mcrost"> <name>Microsoft Corporation</name> <newsmentions>0</newsmentions> <newshits>0</newshits> </company> <industry code="i3302"> <name>Computers/Electronics</name> </industry> <industry code="i3302021"> <name>Applications Software</name> </industry> <industry code="i257"> <name>Pharmaceuticals</name> </industry> <industry code="i330202"> <name>Software</name> </industry> <industry code="icomp"> <name>Computing</name> </industry> <industry code="i3302020"> <name>Systems Software</name> </industry> <industry code="i372"> <name>Medical Equipment/Supplies</name> </industry> <industry code="i951"> <name>Health Care</name> </industry> <industry code="iphmed"> <name>Medical/Surgical Instruments/Apparatus/Devices</name +> </industry> <region code="usa"> <name>United States</name> </region> <region code="namz"> <name>North American Countries/Regions</name> </region> <newssubject code="c42" position="0"> <name>Labor/Personnel Issues</name> </newssubject> <newssubject code="ghepat" position="0"> <name>Hepatitis</name> </newssubject> <newssubject code="mstock" position="0"> <name>Stock Exchanges</name> </newssubject> <newssubject code="npress" position="0"> <name>Press Release</name> </newssubject> <newssubject code="ccat" position="0"> <name>Corporate/Industrial News</name> </newssubject> <newssubject code="gcat" position="0"> <name>Political/General News</name> </newssubject> <newssubject code="ghea" position="0"> <name>Health</name> </newssubject> <newssubject code="gmed" position="0"> <name>Medical Conditions</name> </newssubject> <newssubject code="m11" position="0"> <name>Equity Markets</name> </newssubject> <newssubject code="mcat" position="0"> <name>Commodity/Financial Market News</name> </newssubject> <newssubject code="ncat" position="0"> <name>Content Types</name> </newssubject> <newssubject code="nfact" position="0"> <name>Factiva Filters</name> </newssubject> <newssubject code="nfce" position="0"> <name>FC&amp;E Exclusion Filter</name> </newssubject> <newssubject code="nfcpin" position="0"> <name>FC&amp;E Industry News Filter</name> </newssubject> <sourcecode>MTPW</sourcecode> <tailparagraphs> </tailparagraphs> <contact>Liquid Tycoon | e-mail: info@LiquidTycoon.com | Tel: +1 214 556 6798 </contact> <logo source="http://logos.factiva.com.ezproxy.insead.edu" i +mage="mtpwLogo.gif"></logo> <wordcount>679</wordcount> </article>

In reply to Returning the right results, but at the wrong time from subroutines by leighgable

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.