Re^4: How best to strip text from a file?

Replies are listed 'Best First'.
Re^5: How best to strip text from a file? by Kenosis (Priest) on Nov 07, 2012 at 23:31 UTC
You're most welcome, bobdabuilda! ...there are usually numerous Orders containing the multiple distributions... Suspected so. What separates these Orders? One option is to set the record separator (`$/`) to the text that separates Orders, and then do the matching on each Order.	[reply] [d/l]
Re^6: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 08, 2012 at 01:29 UTC
Yes, that's what I've been looking at (trying) doing. The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator. The report also spans multiple pages, including a header on each page, which complicates things just that little bit more also... but I'll worry about that later, once I have the logic for the full order sorted. The page header should be automatically filtered out by the regex the way it stands anyway... I think. One thing I could do with a suggestion on, is how to handle breaking out of the loop at the end of each Order. About the only way I can think of to know to stop processing distributions, is to look for the start of the next Order record. In order to do that, though, the line containing data I want has to be read in at the "end" of the loop for the previous Order... and then back up at the start of the loop, it reads the next line of the file in, dropping the previous one, which contains (some of) the data I'm after. Probably easier to show you what I mean in pseudocode to give a better idea : `while <DATA> { if (start of record) { get order details while (not a new order) { get distribution details into a hash } print order details and distributions to Excel } }` [download] So, from the above, the issue I am having is the two While loops... the second one "eats" the order info of any Orders following the first. I'm sure I could put some post-While processing there to trap the data before it loops to the next line... but that just seems a bit... uncouth, for wont of a better word. Can't help thinking it should be more elegant (not to mention less likely to fail) than that.	[reply] [d/l]
Re^7: How best to strip text from a file? by Kenosis (Priest) on Nov 08, 2012 at 05:05 UTC
Hi, bobdabuilda You've given this much thought, and I think you're pseudocode is on target. The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator. The "Order ID:" as record separator makes sense. The page header should be automatically filtered out by the regex the way it stands anyway... I think. You're correct. I've taken the liberty to implement an interpretation of this. It does use two loops, but the outer loop is a `for` loop that iterates over an array of Order records: use strict; use warnings; use Data::Dumper; # Place a filename into $recordsFile to read Orders from that file # else the Orders below __DATA__ will be used for demo purposes my $recordsFile = ''; my ( @records, @orders ); my $recSeparator = 'Order ID:'; # Orders will initially be array elements 1 .. n in @orders; element 0 + is initially the first page header { # Set the record separator local $/ = $recSeparator; # If there's a file name, try to read from that file if ($recordsFile) { open my $fh, '<', $recordsFile or die $!; @records = <$fh>; close $fh; } else { @records = <DATA>; } } # Remove the first page header shift @records; # Add Order ID: back into each record for later matching $_ = "$recSeparator$_" for @records; # Iterate through each record (Order) for my $record (@records) { my %hash; # Treat the record string like a file, opening it for reading open my $sh, '<', \$record or die "Unable to open record string: $ +!"; # Read the string like a file, one line at a time now while (<$sh>) { $hash{orderID} //= do { /Order ID:(\S+)/; $1 }; $hash{fiscalCycle} //= do { /cycle:(\d+)/; $1 }; $hash{vendorID} //= do { /Vendor ID:(\S+)/; $1 }; $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 }; $hash{copies} //= do { /copies:(\d+)/; $1 }; $hash{title} //= do { /Title:(.+)/; $1 }; $hash{'ISBN/ISSN'} //= do { m{ISBN/ISSN:(\S+)}; $1 }; # Distributions started? if (/Distribution--/) { # Save the current record separator my $oldRecSeparator = $/; # Set a new record separator local $/ = 'Distribution--'; # Read the string like a file, a distribution 'chunk' at a + time while (<$sh>) { my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; push @{ $hash{distribution} }, \%tempHash; } # Restore the old record separator $/ = $oldRecSeparator; } } # Work with the filled-in %hash by sending a reference to it to a +subroutine # This is a complete record writeToSpreadSheet( \%hash ); print Dumper \%hash; # Done 'reading' the string close $sh; } # Printing in a subroutine's not a good idea, but done here only to sh +ow how to access the hash sub writeToSpreadSheet { my ($hashReference) = @_; # The $$ notation dereferences the hash reference print $$hashReference{vendorID}, "\n"; # The @{} notation deferences the array reference; the arrow opera +tor deferences to get hash value for my $distribution ( @{ $$hashReference{distribution} } ) { print $distribution->{holdingCode}, "\n"; } print "\n"; } __DATA__ List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-1111 fiscal cycle:21112 Vendor ID:VEND11 order type:SUBSCRIPT 15) requisition number: copies:417 call number:XX(11111111.111) ISBN/ISSN:1111-111X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO9 copies:5 date received:11/6/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO8 copies:4 date received:11/9/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO7 copies:3 date received:11/8/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO6 copies:2 date received:11/8/2012 date lo +aded:12/6/2012 [download] Output VEND99 CODEINFO1 CODEINFO3 CODEINFO2 $VAR1 = { 'vendorID' => 'VEND99', 'copies' => '9', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/6/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO1' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/9/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO3' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '25/8/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO2' } ], 'ISBN/ISSN' => '9999-999X', 'title' => 'Item title here.', 'orderID' => 'PO-9999', 'requisitionNum' => '15' }; VEND11 CODEINFO9 CODEINFO8 CODEINFO7 CODEINFO6 $VAR1 = { 'vendorID' => 'VEND11', 'copies' => '417', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/6/2012', 'copies' => '5', 'holdingCode' => 'CODEINFO9' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/9/2012', 'copies' => '4', 'holdingCode' => 'CODEINFO8' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/8/2012', 'copies' => '3', 'holdingCode' => 'CODEINFO7' }, { 'dateLoaded' => '12/6/2012', 'dateReceived' => '11/8/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO6' } ], 'ISBN/ISSN' => '1111-111X', 'title' => 'Item title here.', 'requisitionNum' => '15', 'orderID' => 'PO-1111' }; [download] Included a subroutine and a call to it that shows how to handle accessing the hash a record at a time. The code is commented, to assist with understanding it. Let me know if you have any questions about this... Enjoy!	[reply] [d/l] [select]
Re^8: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 09, 2012 at 04:55 UTC
Re^9: How best to strip text from a file? by Kenosis (Priest) on Nov 09, 2012 at 06:02 UTC
Some notes below your chosen depth have not been shown here

You're most welcome, bobdabuilda!

...there are usually numerous Orders containing the multiple distributions...

Suspected so. What separates these Orders? One option is to set the record separator ($/) to the text that separates Orders, and then do the matching on each Order.

[reply]
[d/l]

Yes, that's what I've been looking at (trying) doing. The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator.

The report also spans multiple pages, including a header on each page, which complicates things just that little bit more also... but I'll worry about that later, once I have the logic for the full order sorted. The page header should be automatically filtered out by the regex the way it stands anyway... I think.

One thing I *could* do with a suggestion on, is how to handle breaking out of the loop at the end of each Order. About the only way I can think of to know to stop processing distributions, is to look for the start of the next Order record. In order to do that, though, the line containing data I want has to be read in at the "end" of the loop for the previous Order... and then back up at the start of the loop, it reads the next line of the file in, dropping the previous one, which contains (some of) the data I'm after.

Probably easier to show you what I mean in pseudocode to give a better idea :

while <DATA> {
  if (start of record) {
    get order details
    while (not a new order) {
      get distribution details into a hash
    }
    print order details and distributions to Excel
  }
}
[download]

So, from the above, the issue I am having is the two While loops... the second one "eats" the order info of any Orders following the first. I'm sure I could put some post-While processing there to trap the data before it loops to the next line... but that just seems a bit... uncouth, for wont of a better word. Can't help thinking it should be more elegant (not to mention less likely to fail) than that.

[reply]
[d/l]

Hi, bobdabuilda

You've given this much thought, and I think you're pseudocode is on target.

The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator.

The "Order ID:" as record separator makes sense.

The page header should be automatically filtered out by the regex the way it stands anyway... I think.

You're correct.

I've taken the liberty to implement an interpretation of this. It does use two loops, but the outer loop is a for loop that iterates over an array of Order records:

use strict;
use warnings;
use Data::Dumper;

# Place a filename into $recordsFile to read Orders from that file
#  else the Orders below __DATA__ will be used for demo purposes
my $recordsFile = '';

my ( @records, @orders );
my $recSeparator = 'Order ID:';

# Orders will initially be array elements 1 .. n in @orders; element 0
+ is initially the first page header
{
    # Set the record separator
    local $/ = $recSeparator;

    # If there's a file name, try to read from that file
    if ($recordsFile) {
        open my $fh, '<', $recordsFile or die $!;
        @records = <$fh>;
        close $fh;
    }
    else {
        @records = <DATA>;
    }
}

# Remove the first page header
shift @records;

# Add Order ID: back into each record for later matching
$_ = "$recSeparator$_" for @records;

# Iterate through each record (Order)
for my $record (@records) {
    my %hash;

    # Treat the record string like a file, opening it for reading
    open my $sh, '<', \$record or die "Unable to open record string: $
+!";

    # Read the string like a file, one line at a time now
    while (<$sh>) {
        $hash{orderID}        //= do { /Order ID:(\S+)/;        $1 };
        $hash{fiscalCycle}    //= do { /cycle:(\d+)/;           $1 };
        $hash{vendorID}       //= do { /Vendor ID:(\S+)/;       $1 };
        $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 };
        $hash{copies}         //= do { /copies:(\d+)/;          $1 };
        $hash{title}          //= do { /Title:(.+)/;            $1 };
        $hash{'ISBN/ISSN'}    //= do { m{ISBN/ISSN:(\S+)};      $1 };

        # Distributions started?
        if (/Distribution--/) {

            # Save the current record separator
            my $oldRecSeparator = $/;

            # Set a new record separator
            local $/ = 'Distribution--';

            # Read the string like a file, a distribution 'chunk' at a
+ time
            while (<$sh>) {
                my %tempHash;

                ( $tempHash{holdingCode} )  = /code:(\S+)/;
                ( $tempHash{copies} )       = /copies:(\d+)/;
                ( $tempHash{dateReceived} ) = /received:(\S+)/;
                ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

                push @{ $hash{distribution} }, \%tempHash;
            }

            # Restore the old record separator
            $/ = $oldRecSeparator;
        }
    }

    # Work with the filled-in %hash by sending a reference to it to a 
+subroutine
    # This is a complete record
    writeToSpreadSheet( \%hash );
    
    print Dumper \%hash;

    # Done 'reading' the string
    close $sh;
}


# Printing in a subroutine's not a good idea, but done here only to sh
+ow how to access the hash
sub writeToSpreadSheet {
    my ($hashReference) = @_;

    # The $$ notation dereferences the hash reference
    print $$hashReference{vendorID}, "\n";

    # The @{} notation deferences the array reference; the arrow opera
+tor deferences to get hash value
    for my $distribution ( @{ $$hashReference{distribution} } ) {
        print $distribution->{holdingCode}, "\n";
    }

    print "\n";
}

__DATA__
                             List of Distributions                    
+          
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9    
                call number:XX(9999999.999)                          
                  ISBN/ISSN:9999-999X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999      
            holding code:CODEINFO1                   copies:1    
           date received:27/6/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2    
           date received:27/9/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1    
           date received:25/8/2012                             date lo
+aded:27/6/2012

                              List of Distributions                   
+           
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-1111                  fiscal cycle:21112
      Vendor ID:VEND11                     order type:SUBSCRIPT
    15)   requisition number:                      copies:417    
                call number:XX(11111111.111)                          
                  ISBN/ISSN:1111-111X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-111      
            holding code:CODEINFO9                   copies:5    
           date received:11/6/2012                             date lo
+aded:12/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO8                    copies:4    
           date received:11/9/2012                             date lo
+aded:12/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO7                     copies:3    
           date received:11/8/2012                             date lo
+aded:12/6/2012
           
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO6                     copies:2    
           date received:11/8/2012                             date lo
+aded:12/6/2012
[download]

Output

VEND99
CODEINFO1
CODEINFO3
CODEINFO2

$VAR1 = {
          'vendorID' => 'VEND99',
          'copies' => '9',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/6/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO1'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/9/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO3'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '25/8/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO2'
                              }
                            ],
          'ISBN/ISSN' => '9999-999X',
          'title' => 'Item title here.',
          'orderID' => 'PO-9999',
          'requisitionNum' => '15'
        };
VEND11
CODEINFO9
CODEINFO8
CODEINFO7
CODEINFO6

$VAR1 = {
          'vendorID' => 'VEND11',
          'copies' => '417',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/6/2012',
                                'copies' => '5',
                                'holdingCode' => 'CODEINFO9'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/9/2012',
                                'copies' => '4',
                                'holdingCode' => 'CODEINFO8'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/8/2012',
                                'copies' => '3',
                                'holdingCode' => 'CODEINFO7'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/8/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO6'
                              }
                            ],
          'ISBN/ISSN' => '1111-111X',
          'title' => 'Item title here.',
          'requisitionNum' => '15',
          'orderID' => 'PO-1111'
        };
[download]

Included a subroutine and a call to it that shows how to handle accessing the hash a record at a time.

The code is commented, to assist with understanding it.

Let me know if you have any questions about this...

Enjoy!

[reply]
[d/l]
[select]