parsing question

smist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: parsing question
by BrowserUk (Patriarch) on Dec 07, 2006 at 04:01 UTC

As the file is in several distinct sections each with it's own format, first break the file into those sections. As it's also a fairly small file, slurp it and split it on the section separator:

## Slurp the file and break into sections
my @sections = split '-{103}', do{ local $/; <> };
close *ARGV;
[download]

Once you have the sections separated, you can treat each one differently.

Rather than having to count all the spaces and manually construct the unpack formats for dealing with them--which is a PIA and they also might change--notice that each section of stats is preceeded by a header line, and that the left hand edge of the column headers forms a left edge limit for the data in the columns.

Also notice that although some of the column titles are multiple words, every title is preceeded by at least two spaces, whilst the multi-word titles themselves contain only single spaces.

That information allows you to use the header lines to construct the unpack formats programmically. The following subroutine takes a header line, uses it to discover the column boundaries and then uses those to construct a format:

sub buildFmt {
    my $templ = shift;
    my @cols;
    push @cols, $-[ 0 ] while $templ =~ m[(?<=\s\s)(?=\S)]gc;
    my $p = 0;
    my $fmt = '';
    $fmt .= 'A' . ( $_ - $p ) . ' ' and $p = $_ for @cols;
    return $fmt;
}
[download]

This can be reused for all the columnised (sub)sections in the file.

By way of example, this is how to use it break down the four subsections of the 'PLAYER GAME STATISTICS':

## Section 2
## Break the section into lines
my @section2 = split "\n", $sections[ 1 ];

## discard header lines;
shift @section2 for 1 .. 3; 

## Away

## Construct the format from the header
my $fmt = buildFmt( shift @section2 ); 

## Use it to parse the Away player game stats
my @awayStats;
push @awayStats, [ unpack $fmt, shift @section2 ] while $section2[ 0 ]
+ =~ m[\S];
print "@$_" for @awayStats;

## Discard blank lines
shift @section2 while $section2[ 0 ] !~ m[\S];

## Away totals ... same two steps again
$fmt = buildFmt( shift @section2 );
my @awayTotals = unpack $fmt, shift @section2;
print "@awayTotals";

## Discard blank lines
shift @section2 while $section2[ 0 ] !~ m[\S];

## home  ... and again
$fmt = buildFmt( shift @section2 ); ## They could vary.
my @homeStats;
push @homeStats, [ unpack $fmt, shift @section2 ] while $section2[ 0 ]
+ =~ m[\S];
print "@$_" for @homeStats;

## Discard blank lines
shift @section2 while $section2[ 0 ] !~ m[\S];

## Home totals ... and again
$fmt = buildFmt( shift @section2 );
my @homeTotals = unpack $fmt, shift @section2;
print "@homeTotals";
[download]

Parsing the other sections (with columnised data), is just a repeat of the above.

The code all together as far as I've taken it:

Read more... (2 kB)

The output from section 2 debugging lines:

c:\test>588245 boxscore_340251.txt
4 | Drevitch, Scott |  | 1 |  +1 | 4 |  |  |  |  |  |  |  |  |  |  |  
+|  |
8 | Mann, Chris |  | 1 |  +1 | 2 | 4 | 2 |  |  |  |  |  |  |  |  |  | 
+ |
9 | Anderson, Erik | 1 |  |  +1 | 4 |  |  |  |  |  |  |  |  |  |  |  |
+  |
11 | Lefebvre, Marc |  |  |  -1 |  |  |  |  |  |  |  |  |  |  |  |  | 
+ |
14 | Scott, Mark |  | 1 |   0 | 5 |  |  |  |  |  |  |  |  |  |  |  |  
+|
17 | Wray, Scott |  |  |   0 | 10 | 4 | 2 |  |  |  |  |  |  |  |  |  |
+  |
18 | Lazarev, Yevgeny | 1 |  |   0 | 1 |  |  |  |  |  |  |  |  |  |  |
+  |  |
20 | Miller, Derek |  |  |  -2 | 1 |  |  |  |  |  |  |  |  |  |  |  | 
+ |
21 | Fitzpatrick, Chans |  | 1 |   0 |  | 2 | 1 |  |  |  |  |  |  |  |
+  |  |  |
22 | Cullaton, Brent |  | 1 |  +1 | 4 |  |  |  |  |  |  |  |  |  |  | 
+ |  |
23 | Kotsopoulos, Tommy |  | 1 |  -1 | 4 | 5 |  | 1 |  |  |  |  |  |  
+|  |  |  |
24 | Morelli, David | 1 |  |   0 | 2 |  |  |  |  |  |  |  |  |  |  |  
+|  |
27 | Littlejohn, Frank | 1 | 1 |  +1 | 3 | 6 | 3 |  |  |  |  |  | 1 | 
+ |  |  |  |
28 | Lyke, R.C. |  |  |  +1 |  |  |  |  |  |  |  |  |  |  |  |  |  |
29 | Gajda, Tyson |  |  |   0 |  |  |  |  |  |  |  |  |  |  |  |  |  |
30 | Tebbs, Kris |  |  |   0 |  |  |  |  |  |  |  |  |  |  |  |  |  |
32 | Tidball, Curtis |  |  |   0 |  |  |  |  |  |  |  |  |  |  |  |  |
+  |
44 | Lupul, Dale |  |  |  -1 | 5 |  |  |  |  |  |  |  |  |  |  |  |  |
   TOTALS | 5 | 7 |  +1 | 46 | 23 | 9 | 1 |  |  |  |  | 1 |  |  | 1 | 
+1 |
2 | Lupandin, Andrei |  | 3 |  +1 | 1 |  |  |  |  |  |  |  |  |  |  | 
+ |  |
4 | Currie, Brent |  |  |  -1 |  | 2 | 1 |  |  |  |  |  |  |  |  |  | 
+ |
5 | Yoder, Jami |  |  |  -1 |  | 2 | 1 |  |  |  |  |  |  |  |  |  |  |
7 | Radoslovich, Matt | 1 |  |  -1 | 2 |  |  |  |  |  |  |  |  |  |  |
+  |  |
8 | Pilkington, Brett |  | 1 |  +1 | 4 |  |  |  |  |  |  |  |  |  |  |
+  |  |
9 | Granbois, Travis |  |  |   0 |  |  |  |  |  |  |  |  |  |  |  |  |
+  |
12 | Nadeau, Patrick |  |  |  -2 | 1 |  |  |  |  |  |  |  |  |  |  |  
+|  |
13 | Parsons, Don | 2 | 1 |  +2 | 4 |  |  |  |  |  |  |  | 1 |  |  |  
+|  |
14 | Starke, Sean |  |  |   0 | 3 |  |  |  |  |  |  |  |  |  |  |  |  
+|
17 | Stewart, Blake |  | 1 |  -2 | 1 |  |  |  |  |  |  |  |  |  |  |  
+|  |
18 | Chwedoruk, Justin |  |  |   0 | 3 |  |  |  |  |  |  |  |  |  |  |
+  |  |
19 | Woollard, Chad | 1 | 1 |  +2 | 3 | 2 | 1 |  |  |  |  |  |  |  |  
+|  |  |
20 | Durdin, Sergei |  |  |  -2 | 2 |  |  |  |  |  |  |  |  |  |  |  |
+  |
24 | Harloff, Nick |  |  |   0 | 3 | 4 | 2 |  |  |  |  |  |  |  |  |  
+|  |
25 | Stauffacher, Luke |  |  |   0 | 4 | 7 | 1 | 1 |  |  |  |  |  |  |
+  |  |  |
27 | Wathier, Mathieu |  |  |  +1 | 3 |  |  |  |  |  |  |  |  |  |  | 
+ |  |
41 | Sikich, Zach |  | 1 |   0 |  |  |  |  |  |  |  |  |  |  |  |  |  
+|
83 | Tapp, Jason |  |  |   0 |  |  |  |  |  |  |  |  |  |  |  |  |  |
   TOTALS | 4 | 8 |  -2 | 34 | 17 | 6 | 1 |  |  |  |  | 1 |  |  |  |  
+|
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re^2: parsing question

by smist (Acolyte) on Dec 07, 2006 at 13:44 UTC

my @sections = split '-{103}', do{ local $/; <> };
[download]

[reply]
[d/l]

Re^3: parsing question

by joeface (Pilgrim) on Dec 07, 2006 at 16:20 UTC

This sets the default line-separator (held in "$/", normally set to "\n") to undef, so that whatever comes in from STDIN (<>) will get slurped up, split on the string "-{103}", and assigned to @sections.

A short example... put the following into a script named "test.pl":

#!/usr/bin/perl -w

use strict;

my @sections = split 'FOO', do {local $/; <>};
close *ARGV;

print join("\n", @sections);
[download]

Then, put the following into a file called "test.txt":

FOObarFOObarFOObarFOObarFOObar
FOObarFOObarFOObarFOObarFOObar
FOObarFOObarFOObarFOObarFOObar
[download]

Then, run the following at the command-line:

cat test.txt | ./test.pl
[download]

[reply]
[d/l]
[select]

Re^4: parsing question

by smist (Acolyte) on Dec 07, 2006 at 17:42 UTC

Re: parsing question
by ikegami (Patriarch) on Dec 07, 2006 at 02:54 UTC

unpack

my $format = join '', map { "A$_" }
   3,  # Player number
   21, # Player name
   4,  # G
   4,  # A
   6,  # +/-
   5,  # S
   6,  # PIM
   6,  # MIN
   6,  # MAJ
   5,  # 10
   5,  # GM
   5,  # GR
   5,  # MT
   6,  # PPG
   6,  # SHG
   6,  # GWG
   6,  # OTG
   6,  # UAG
   6,  # ENG
   5;  # FG

while (<$fh>) {
   next unless /^\d+\s/;              # Skip headers.
   my @record = unpack($format, $_);  # Parse data.
   $record[4] =~ s/^\s+//;            # Fix "+/-" formatting.
   $sth->execute(@record);            # Add data to the database.
}
[download]

[reply]
[d/l]
[select]

Re: parsing question
by grep (Monsignor) on Dec 07, 2006 at 03:16 UTC

unpack


use strict;
use warnings;
use Data::Dumper;

my @fields = qw/num name G  A  plus S  PIM MIN MAJ ten GM GR MT PPG SH
+G GWG OTG UAG ENG FG/;
my $pattern = "A3   A21  A4 A5 A5   A5 A5  A5  A5  A5  A5 A5 A5 A5  A5
+  A5  A5  A5  A5  A5";
my @data;

while (my $line = <DATA>) {
    chomp($line);
    next if ($line !~ /^\d+ /);
    my %hash;
    @hash{@fields} = unpack($pattern,$line); # unpack the data accordi
+ng to the pattern
    push(@data,\%hash);
} 
print Dumper \@data;

## UPDATE: Reformated code
[download]

grep

XP matters not. Look at me. Judge me by my XP, do you?

[reply]
[d/l]

Re^2: parsing question

by ikegami (Patriarch) on Dec 07, 2006 at 06:54 UTC

First off you need to make sure you have permission to use the copyrighted material you linked to

Copyright doesn't protect statistics. It does copy documents containing statistics (such as the linked text file), but we're not making a copy of the text file[*].

There could be contractual (licensing, Terms of Service) issues, but no Copyright issues.

* — Well, technically the file is copied numerous times (e.g. from the TCP stream into RAM), but Copyright allows those copies since they are necessary steps to viewing the file.

[reply]

Re: parsing question
by wjw (Priest) on Dec 07, 2006 at 03:36 UTC

...the majority is always wrong, and always the last to know about it...

[reply]

Re^2: parsing question

by smist (Acolyte) on Dec 07, 2006 at 10:47 UTC

I need the player stats, including the goalies. I don't need the team stats. What I'm working on is creating a very simple fantasy hockey league for the league my local team is in. It will be a free league to play in for the fans.

[reply]

[OT]Re: parsing question
by japhy (Canon) on Dec 07, 2006 at 05:08 UTC

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

[reply]

Re: [OT]Re: parsing question

by smist (Acolyte) on Dec 07, 2006 at 10:43 UTC

GM is Game Misconduct, I believe. I'm not sure of the other two. I'm not using all of the stats.

[reply]