Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Is there a faster Way to parse

by mmittiga17 (Scribe)
on Sep 24, 2008 at 18:33 UTC ( [id://713485]=perlquestion: print w/replies, xml ) Need Help??

mmittiga17 has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: Is there a faster Way to parse
by jethro (Monsignor) on Sep 24, 2008 at 19:25 UTC

    Whenever you have a loop searching for something you should ask yourself if a hash couldn't do it better

    In this case you want to collect lines with the same ID. So you might do this using a Hash of Arrays:

    while( my $line = <> ){ next unless $line; chomp $line; next if ($line =~ /EQUITY SRFDYNAM/); my $USERNUM = (split /\s+/, $line)[0]; # push(@list, $USERNUM); push @{$usernums{$USERNUM}}, $line }

    Now all the lines with same usernum are grouped together, so looping through it get easy:

    foreach my $usernum ( keys %usernums} { foreach my $line ( @{$usernums{$usernum}} ) { $USERID = substr ... ... } print OUT ...

    The time consuming in your old code was that you had to search the whole 500k lines for every usernum, that is roughly 500k*100k accesses to @LINES, if each usernum had on average 5 lines in the file. The new method just does 100k*5 accesses to %usernums or once each line

Re: Is there a faster Way to parse
by moritz (Cardinal) on Sep 24, 2008 at 18:39 UTC
    Sure, just use a hash. But since that has been suggested multiple times it's probably useless to tell you again.
Re: Is there a faster Way to parse
by jwkrahn (Abbot) on Sep 24, 2008 at 20:20 UTC

    You could try it like this:

    #!/usr/bin/perl use warnings; use strict; my %template = ( CA0 => { headers => [ 'userid', 'rectype', 'asset' ], format => 'x +12 A A3 x35 A' }, CD0 => { headers => [ 'ides' ], format => 'x18 A89' }, CX4 => { headers => [ 'extk' ], format => 'x18 A16' }, CD1 => { headers => [ 'ctry', 'sect', 'typ2', 'sicd', 'igc' ], for +mat => 'x18 A3 A3 A2 A4 A2' }, ); open OUT, '>', 'IDC_EQ.CSV' or die "Cannot open 'IDC_EQ.CSV' $!"; my %data; while ( <> ) { next if !/\S/ or /EQUITY SRFDYNAM/; my ( $item, $userid ) = unpack 'A12 A4', $_; @{ $data{ $item } }{ @{ $template{ $userid }{ headers } } } = unpa +ck $template{ $userid }{ format }, $_ if exists $template{ $userid }; if ( keys %{ $data{ $item } } == 10 ) { print OUT join( ',', $item, @{ $data{ $item } }{ qw/userid rec +type asset ides extk ctry sect typ2 sicd igc/ } ), "\n"; delete $data{ $item }; } } __END__
Re: Is there a faster Way to parse
by GrandFather (Saint) on Sep 24, 2008 at 22:12 UTC

    I strongly recommend that you use strictures (use strict; use warnings;). Providing a sample that actually runs and gives the output you expect would help a lot if you want an accurate solution.

    That aside, the following code makes one pass through the file to build a hash keyed by (assumed) user id and (assumed) line reference. It then runs through the hash to generate each report line - one line per user id. Replace the elipsis with the sample data.

    use strict; use warnings; my $data = <<DATA; ... DATA my %userData; open my $inData, '<', \$data; while (<$inData>) { chomp; next if /EQUITY SRFDYNAM/; my ($userNum, $ref, $tail) = split ' ', $_, 3; next unless defined $tail; $userData{$userNum}{$ref} = $tail; } close $inData; for my $userNum (sort keys %userData) { my $line = $userData{$userNum}{CA0}; my $USERID = substr $line, 0, 1; my $RECTYPE = substr $line, 1, 2; my $ASSET = substr $line, 39, 1; $line = $userData{$userNum}{CD0}; my $IDES = substr $line, 2, 10; $line = $userData{$userNum}{CX4}; my $EXTK = substr $line, 2, 16; $line = $userData{$userNum}{CD1}; my $CTRY = substr $line, 2, 3; my $SECT = substr $line, 1, 3; my $TYP2 = substr $line, 4, 2; my $SICD = substr $line, 6, 4; my $IGC = substr $line, 10, 2; print "$userNum,$USERID,$RECTYPE,$ASSET,$IDES,$EXTK,$CTRY,$SECT,$TYP +2,$SICD,$IGC\n"; }

    Prints:

    A02545142,0,12, ,GENERALI H,GRHVF ,OUT,1OU,TF,C ,00 A03987103,0,12, ,AAP IMPLAN,APIPF ,OUT,1OU,TF,C ,38 A05345110,0,12, ,AT & S AUS,AAHKF ,OUT,1OU,TF,C ,00

    Perl reduces RSI - it saves typing
      Sorry for the late response. But this worked great and I was not only able to learn from this but to apply it to many other scripts I needed to write. Thanks so much for your help
Re: Is there a faster Way to parse
by dsheroh (Monsignor) on Sep 25, 2008 at 02:26 UTC
    - If you're at all concerned about performance, then why are you repeatedly evaluating the same regex?
    foreach $line (@LINES){ if ($line =~ /^$item/){ if ($line =~ /CA0 /) { $USERID = substr $line, 12, 1; $RECTYPE = substr $line, 13, 2; $ASSET = substr $line, 51, 1; } $IDES = substr $line, 18, 89 if ($line =~ /CD0 /); $EXTK = substr $line, 18, 16 if ($line =~ /CX4 /); if ($line =~ /CD1 /) { $CTRY = substr $line, 18, 3; $SECT = substr $line, 21, 3; $TYP2 = substr $line, 24, 2; $SICD = substr $line, 26, 4; $IGC = substr $line, 30, 2; } } }
    is equivalent and should run a bit faster since it only runs the CA0 and CD1 regexes once each.

    - In the sample data, the CA0/CD0/CX4/CD1 identifiers are always found at the same position within each line when present. Using substr and testing equality instead of running regexes would be a good deal faster. It would also avoid false positives if those sequences happen to appear elsewhere on the line. If their position isn't actually fixed in the real data, using index to check for their presence would also be faster than a regex, although not as fast as substr.

    - In the sample data, the four codes you're looking for appear to be mutually exclusive. You can probably improve performance a bit by using if... elsif... elsif... and ordering them so that the most common is checked first, then the next-most-common, and so on. Once one is matched, this would prevent the rest from being checked and avoid unnecessary work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://713485]
Approved by Hercynium
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2024-03-28 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found