Sorry about this title, but I don't know what the correct terminology for my current woes would be. Any assistance is greatly appreciated.
I am using some subroutines to grab data out of xml files, and then return some summary information such as "total number of article elements", but I am also using an unique reference ID to check for duplicate article elements.
It's important that the program keeps track of what files it's parsing, and returns the corresponding summary data with the right file.
As you can see if you run my program on the directory of sample xml data, the results correspond to the file before the one listed, and whatever the first file, I get double zero result for the first iteration while the program "revs up".
I know my error must be a simple one, but I have had no luck finding the solution, maybe for lack of knowing the terminology. Any clues would be greatly appreciated.
I am new to programming, just trying to stumble my way through this project, so if you see something in my code that makes you roll your eyes, please let me know. I'm looking for helpful criticism.
All the best,
Leigh
Update: I changed the download link to a site with no registration, and posted a data snippet at the bottom, also ran my code through perltidy. Thanks for tips.
Here is the data.#!/usr/bin/perl
# turn on perl safety features
use strict;
use warnings;
#initialize modules
use XML::Twig;
use DirHandle;
my ($dir, $filepath, @filename, @filepath_list, %company_hist, @co
+mpany_list, %reference_hist, @reference_list);
$dir = $ARGV[0] or die "Must specify directory";
@filepath_list = get_file_list($dir);
foreach $filepath (@filepath_list)
{
my $twig = XML::Twig -> new(
twig_roots => { 'article/reference' => \&get_ref,
company => \&get_code
});
$company_hist{$_}++ for @company_list; #sort results from
my @unique_comp = keys %company_hist; #"get_code" sub then
@company_list = ( sort { $company_hist{$b} <=> #return 3 most f
+req
$company_hist{$a} } @unique_comp )[0..2]; #codes.
#my $ref_length = scalar( @reference_list ); #take ref list and
$reference_hist{$_}++ for @reference_list; #eliminate duplicate
+s
my @unique_ref = keys %reference_hist; #return a tally
my $uni_count = scalar ( @unique_ref );
my @by_date_tally = get_date(\@unique_ref);
my $dup_count = (scalar( @reference_list ) - $uni_count);
#my($k, $v);
#while ( ($k,$v) = each %reference_hist ) {
#print "$k => $v\n";
#}
#print "File name: ", print_file_name($filepath), "\n";
#print "Total Dupicate Articles: $dup_count\n";
#print "Total Articles Found: $uni_count\n";
#print "@company_list\n";
undef %company_hist; #reinitialize global
undef @company_list; #vars.
undef %reference_hist; #
undef @reference_list; #
$twig->parsefile($filepath); # purge to save mem.
$twig->purge;
} #end of foreach loop
exit(0);
sub get_file_list
{
$dir = shift;
print $dir, "\n";
my $dh = DirHandle->new($dir) or die "can't open directory";
return sort # sort pathnames
grep { -f } # choose only files
map { "$dir/$_" } # create full paths
grep { !/^\./ } # filter out dot files
$dh->read(); # read all filenames
}
sub print_file_name
{
my($path, $position, $path_strip); #take filepath and
$path = $_[0]; #return filename.
$position = rindex($path,"/") + 1; #
$path_strip = substr($path, $position);
#print "For file: $path_strip\n";
return $path_strip;
}
sub get_code
{
my $company; #get company code
my( $twig, $elt)= @_; #attribute and
$company = $elt->{'att'}->{'code'}; #put into array
push @company_list, $company;
#return @company_list;
}
sub get_ref
{
my( $twig, $elt)= @_; #take reference elt
my $ref = $elt; #and return just the
my $position = rindex($ref->text(), "/") + 1; #reference ID str
+ing
my $ref_strip = substr($ref->text(), $position);
push @reference_list, $ref_strip;
}
sub get_date
{
my $ref;
#my @refs = @_;
foreach $ref (@_) {
print @$ref, "\t";
}
print "\n\n";
}
And here is some sample data. Normally it's in several files in a directory.
<article>
<accessionno>MTPW000020090731e57v004mr</accessionno>
<reference>distdoc:archive/ArchiveDoc::Article/MTPW000020090
+731e57v004mr</reference>
<baselanguage>EN</baselanguage>
<copyright>(c) 2009 M2 Communications, Ltd. All Rights Reser
+ved. </copyright>
<headline>
<paragraph display="Proportional" truncation="None" lang="
+EN">Anadys Pharmaceuticals, Inc (NASDAQ:ANDS) is the Highest Percenta
+ge Gainers Among NASDAQ Stocks During Morning Trading Hours; Microsof
+t Corporation (NASDAQ:<hlt>MSFT</hlt>) And Orthofix International NV
+(NASDAQ:OFIX) Round Out Top Three Percentage Gainers During Morning T
+rading Hours</paragraph>
</headline>
<publicationdate>
<date>2009-07-31</date>
</publicationdate>
<sourcename>M2 Presswire</sourcename>
<company code="oficks">
<name>Orthofix International N.V.</name>
<newsmentions>0</newsmentions>
<newshits>0</newshits>
</company>
<company code="scrptg">
<name>Anadys Pharmaceuticals Inc</name>
<newsmentions>0</newsmentions>
<newshits>0</newshits>
</company>
<company code="mcrost">
<name>Microsoft Corporation</name>
<newsmentions>0</newsmentions>
<newshits>0</newshits>
</company>
<industry code="i3302">
<name>Computers/Electronics</name>
</industry>
<industry code="i3302021">
<name>Applications Software</name>
</industry>
<industry code="i257">
<name>Pharmaceuticals</name>
</industry>
<industry code="i330202">
<name>Software</name>
</industry>
<industry code="icomp">
<name>Computing</name>
</industry>
<industry code="i3302020">
<name>Systems Software</name>
</industry>
<industry code="i372">
<name>Medical Equipment/Supplies</name>
</industry>
<industry code="i951">
<name>Health Care</name>
</industry>
<industry code="iphmed">
<name>Medical/Surgical Instruments/Apparatus/Devices</name
+>
</industry>
<region code="usa">
<name>United States</name>
</region>
<region code="namz">
<name>North American Countries/Regions</name>
</region>
<newssubject code="c42" position="0">
<name>Labor/Personnel Issues</name>
</newssubject>
<newssubject code="ghepat" position="0">
<name>Hepatitis</name>
</newssubject>
<newssubject code="mstock" position="0">
<name>Stock Exchanges</name>
</newssubject>
<newssubject code="npress" position="0">
<name>Press Release</name>
</newssubject>
<newssubject code="ccat" position="0">
<name>Corporate/Industrial News</name>
</newssubject>
<newssubject code="gcat" position="0">
<name>Political/General News</name>
</newssubject>
<newssubject code="ghea" position="0">
<name>Health</name>
</newssubject>
<newssubject code="gmed" position="0">
<name>Medical Conditions</name>
</newssubject>
<newssubject code="m11" position="0">
<name>Equity Markets</name>
</newssubject>
<newssubject code="mcat" position="0">
<name>Commodity/Financial Market News</name>
</newssubject>
<newssubject code="ncat" position="0">
<name>Content Types</name>
</newssubject>
<newssubject code="nfact" position="0">
<name>Factiva Filters</name>
</newssubject>
<newssubject code="nfce" position="0">
<name>FC&E Exclusion Filter</name>
</newssubject>
<newssubject code="nfcpin" position="0">
<name>FC&E Industry News Filter</name>
</newssubject>
<sourcecode>MTPW</sourcecode>
<tailparagraphs>
</tailparagraphs>
<contact>Liquid Tycoon
| e-mail: info@LiquidTycoon.com
| Tel: +1 214 556 6798 </contact>
<logo source="http://logos.factiva.com.ezproxy.insead.edu" i
+mage="mtpwLogo.gif"></logo>
<wordcount>679</wordcount>
</article>