Hi, bobdabuilda
You've given this much thought, and I think you're pseudocode is on target.
The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator.
The "Order ID:" as record separator makes sense.
The page header should be automatically filtered out by the regex the way it stands anyway... I think.
You're correct.
I've taken the liberty to implement an interpretation of this. It does use two loops, but the outer loop is a for loop that iterates over an array of Order records:
use strict; use warnings; use Data::Dumper; # Place a filename into $recordsFile to read Orders from that file # else the Orders below __DATA__ will be used for demo purposes my $recordsFile = ''; my ( @records, @orders ); my $recSeparator = 'Order ID:'; # Orders will initially be array elements 1 .. n in @orders; element 0 + is initially the first page header { # Set the record separator local $/ = $recSeparator; # If there's a file name, try to read from that file if ($recordsFile) { open my $fh, '<', $recordsFile or die $!; @records = <$fh>; close $fh; } else { @records = <DATA>; } } # Remove the first page header shift @records; # Add Order ID: back into each record for later matching $_ = "$recSeparator$_" for @records; # Iterate through each record (Order) for my $record (@records) { my %hash; # Treat the record string like a file, opening it for reading open my $sh, '<', \$record or die "Unable to open record string: $ +!"; # Read the string like a file, one line at a time now while (<$sh>) { $hash{orderID} //= do { /Order ID:(\S+)/; $1 }; $hash{fiscalCycle} //= do { /cycle:(\d+)/; $1 }; $hash{vendorID} //= do { /Vendor ID:(\S+)/; $1 }; $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 }; $hash{copies} //= do { /copies:(\d+)/; $1 }; $hash{title} //= do { /Title:(.+)/; $1 }; $hash{'ISBN/ISSN'} //= do { m{ISBN/ISSN:(\S+)}; $1 }; # Distributions started? if (/Distribution--/) { # Save the current record separator my $oldRecSeparator = $/; # Set a new record separator local $/ = 'Distribution--'; # Read the string like a file, a distribution 'chunk' at a + time while (<$sh>) { my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; push @{ $hash{distribution} }, \%tempHash; } # Restore the old record separator $/ = $oldRecSeparator; } } # Work with the filled-in %hash by sending a reference to it to a +subroutine # This is a complete record writeToSpreadSheet( \%hash ); print Dumper \%hash; # Done 'reading' the string close $sh; } # Printing in a subroutine's not a good idea, but done here only to sh +ow how to access the hash sub writeToSpreadSheet { my ($hashReference) = @_; # The $$ notation dereferences the hash reference print $$hashReference{vendorID}, "\n"; # The @{} notation deferences the array reference; the arrow opera +tor deferences to get hash value for my $distribution ( @{ $$hashReference{distribution} } ) { print $distribution->{holdingCode}, "\n"; } print "\n"; } __DATA__ List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-1111 fiscal cycle:21112 Vendor ID:VEND11 order type:SUBSCRIPT 15) requisition number: copies:417 call number:XX(11111111.111) ISBN/ISSN:1111-111X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO9 copies:5 date received:11/6/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO8 copies:4 date received:11/9/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO7 copies:3 date received:11/8/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO6 copies:2 date received:11/8/2012 date lo +aded:12/6/2012
Output
VEND99 CODEINFO1 CODEINFO3 CODEINFO2 $VAR1 = { 'vendorID' => 'VEND99', 'copies' => '9', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/6/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO1' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/9/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO3' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '25/8/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO2' } ], 'ISBN/ISSN' => '9999-999X', 'title' => 'Item title here.', 'orderID' => 'PO-9999', 'requisitionNum' => '15' }; VEND11 CODEINFO9 CODEINFO8 CODEINFO7 CODEINFO6 $VAR1 = { 'vendorID' => 'VEND11', 'copies' => '417', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/6/2012', 'copies' => '5', 'holdingCode' => 'CODEINFO9' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/9/2012', 'copies' => '4', 'holdingCode' => 'CODEINFO8' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/8/2012', 'copies' => '3', 'holdingCode' => 'CODEINFO7' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/8/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO6' } ], 'ISBN/ISSN' => '1111-111X', 'title' => 'Item title here.', 'requisitionNum' => '15', 'orderID' => 'PO-1111' };
Included a subroutine and a call to it that shows how to handle accessing the hash a record at a time.
The code is commented, to assist with understanding it.
Let me know if you have any questions about this...
Enjoy!
In reply to Re^7: How best to strip text from a file?
by Kenosis
in thread How best to strip text from a file?
by bobdabuilda
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |