Re: Split file based on tag

I could simply tell you how to quickfix your script. Instead, I will show you how to separate your problem into smaller reusable chunks.

First we are going to define a class which holds the data of a single record in the input file:

package MyDataRecord;

# Represents a record of MyData

use Moose;
use Moose::Util::TypeConstraints;

# -DateParse enables additional Date formats
use Class::Date qw(-DateParse);

# load carp for better warnings
use Carp        qw(cluck confess);

# create a subtype for Dates
subtype 'MyDate'  => as class_type('Class::Date');

# coerce string input, depending on the format
coerce  'MyDate'  => from 'Str' => via {
    /^\d{6}$/ ?
    Class::Date->new([unpack "A4A2A2","20$_"]) :
    Class::Date->new($_)
};

# data-members
has it      => ( isa => 'Str',    is => 'rw', required => 1 );
has edition => ( isa => 'Str',    is => 'rw', required => 1 );
has tag     => ( isa => 'Str',    is => 'rw', required => 1 );
has body    => ( isa => 'Str',    is => 'rw', required => 1 );
has date    => ( isa => 'MyDate', is => 'rw', required => 1, coerce =>
+ 1 );
has tdate   => ( isa => 'MyDate', is => 'rw', required => 1, coerce =>
+ 1 );

# check date integrity on construction
sub BUILD {
        my ( $self, $params ) = @_;
        cluck  "date and tdate should be identical\n"
            unless $self->tdate->ymd eq $self->date->ymd;
}

1;
[download]

Next we write a Parser class which can read our file and return an array of hashes.

package MyDataParser;

# parses a file or handle of MyData

use Moose;
use Moose::Util::TypeConstraints;

use IO::File;

subtype 'MyFile' => as class_type('IO::File');
coerce  'MyFile' => from 'Str' => via { IO::File->new($_) };

has file => ( isa => 'MyFile', is => 'rw', coerce => 1 );

sub parse
{
    my $self    = shift;
    my $file    = $self->file;
    my $re_tag  = qr/ ^ \{ ([A-Z]+) \} $ /x;
    my @records = ();
    my %record;
    my $tag;

    # iterate the file
    while ( <$file> )
    {
        # We found a tag
        if ( /$re_tag/ )
        {
            # save the new tag name because we use the value often
            my $ntag = lc $1;

            unless ( keys %record ) {
                # create the first record (only happens on the first l
+ine)
                %record = ( $ntag => '' );
            } else {
                # otherwise cut off the last newline of input
                chomp $record{$tag} if $tag;
                # initialize an empty value for the tag
                $record{$ntag} = '';
                # we found a new record
                if ( $ntag eq 'it' )
                {
                    # save the data as a record object
                    push @records, MyDataRecord->new(%record);
                }
            }
            # remember current tag for the next iterations
            $tag = $ntag;
            next;
        }
        # not a tag, so add the data to the current tags value
        $record{$tag} .= $_;
    }
    # save the last record, since there is no final {IT}
    push @records, MyDataRecord->new(%record);

    return \@records;
}
[download]

Now we created a piece of code that whe can use in every skript we want to, not having to think about the internals of the input file anymore. Regarding the original problem such a script would look like.

package main;

use warnings;
use strict;

use Template;
use MyDataParser;

# get the filename from the console
my $file = shift @ARGV || die "Error. You must specify a filename.\n";

# get the data from our file parser
print "parsing $file\n";
my $records = MyDataParser->new( file => $file )->parse;

# template for XML output
# use Template::Plugin::XML::Escape to ensure valid XML
my $ixemel  = q[
    <?xml version="1.0" encoding="UTF-8"?>
    [% USE XML.Escape %]
    <record>
        <it>[% record.it %]</it>
        <date>[% record.date.strftime('%y%m%d') %]</date>
        <tdate>[% record.tdate.strftime('%A, %B %d, %Y') %]</tdate>
        <edition>[% record.edition %]</edition>
        <tag>[% record.tag %]</tag>
        <body>[% record.body | xml_escape %]</body>
    <record>
];

# iterate the records
for my $record ( @$records )
{
    # create a template object
    my $tt = Template->new();

    # and process it( process->(template, data, filename) );
    print "writing ", $record->tag, "\n";

    die $tt->error(), "\n"
        unless $tt->process( \$ixemel, {record=>$record}, $record->tag
+ );
}

print "done\n";
[download]

Note the use of Templating for the XML output. This does us 2 favours here. First it makes the xml easier to maintain and the Template module does the work of writing our output files for us.

holli

You can lead your users to water, but alas, you cannot drown them.

Comment on Re: Split file based on tag Select or Download Code

Replies are listed 'Best First'.
Re^2: Split file based on tag by Anonymous Monk on Nov 28, 2009 at 03:29 UTC
`sub xml_output { my ($output, $tag, $fh) = @_; if($output) { if($output =~ m/<IT>(.)/) { if($fh) { print $fh "</ROOT>\n"; close($fh); } open($fh, '>', "$1.xml") or die "$1.xml: $!"; print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<RO +OT>\n"; } $output =~ s/\s(?=(<.+>\|<.+\/>\|<\/.+>\|<\/.+><.+>))//g; print $fh "$output</$tag>\n"; } return($fh); } # End of sub sroutine` [download] Please tell me how can i change in the this subroutine to split the file until it finds {IT} tag and filename as the {TAG} value. Please help me with this	[reply] [d/l]
Re^3: Split file based on tag by holli (Abbot) on Nov 28, 2009 at 13:02 UTC
Go and buy yourself a working brain. holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]