sxmwb has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of data where each field is separated by a '|' and the end of record is CRLF or LF. The problem I have run into is that the original application, that I have no control over, allows fields to contain CRLFs or LFs. I thought one approach was to count '|' and validate the number of field per row read in and if there were not enough chomp the CRLF or LF and join the next record to that record. This approach might work but was wondering what other possibilities you someone else may have used to help clean data with this kind of problem. Here is a a sample the data:
EN|486822|||KKJSKA|L|L00219796|STR, JASON A|JASON|A|STR|||||3710 |NORT +H CANTON|OH|44720|||000|0003053964|I||| EN|486823|||YYYYYY|L|L00738657|OCID, SEAN M|SEAN|M|OCID|||||3846 Foxta +il Lane |CINCINNATI|OH|45248|||000|0009544289|I||| EN|486824|||KXXXXP|L||DSBS, ANDREW J|ANDREW|J|DSBS|||||28835 STILXXXXX +X|FARXXXXX HILLS|MI|48334|||000||I|||
Thanks for any ideas. Mike

Replies are listed 'Best First'.
Re: Clean data - where field contains a CRLF
by GrandFather (Saint) on Aug 20, 2006 at 22:51 UTC

    The following works for the sample data you have given. Note the hard wired field count and that the code will die if something bad happens.

    use strict; use warnings; use constant FIELDS => 26; my $line = ''; while (<DATA>) { s/\r//g; chomp; $line .= $_; my $fields = $line=~ tr/|//; next if FIELDS > $fields; die "Field count too great in line $." if FIELDS < $fields; my @fields = split /\|/, $line; $line = ''; print join ' ', @fields, "\n\n"; } __DATA__ EN|486822|||KKJSKA|L|L00219796|STR, JASON A|JASON|A|STR|||||3710 |NORT +H CANTON|OH|44720|||000|0003053964|I||| EN|486823|||YYYYYY|L|L00738657|OCID, SEAN M|SEAN|M|OCID|||||3846 Foxta +il Lane |CINCINNATI|OH|45248|||000|0009544289|I||| EN|486824|||KXXXXP|L||DSBS, ANDREW J|ANDREW|J|DSBS|||||28835 STILXXXXX +X|FARXXXXX HILLS|MI|48334|||000||I|||

    Prints:

    EN 486822 KKJSKA L L00219796 STR, JASON A JASON A STR 3710 NORT +H CANTON OH 44720 000 0003053964 I EN 486823 YYYYYY L L00738657 OCID, SEAN M SEAN M OCID 3846 Foxta +il Lane CINCINNATI OH 45248 000 0009544289 I EN 486824 KXXXXP L DSBS, ANDREW J ANDREW J DSBS 28835 STILXXXXX +X FARXXXXX HILLS MI 48334 000 I

    DWIM is Perl's answer to Gödel
      Minor nitpick, Grampa:
      # s/\r//g; # chomp; # expressed better (less platform dependent) as: s/[\r\n]+//g; # or, to be compulsive, use the numerics: s/[\x0a\x0d]+//g;
      According to the perl docs I've seen, chomp "removes any trailing string that corresponds to the current value of $/".

      If perl has $/ set to "\r\n", taking away the "\r" before chomping might cause the chomp to do nothing at all. (But I'm not a windows user, so I could be wrong about that.)

      Also, depending on the data and the task, it might make more sense to replace every [\r\n]+ with a space, rather than an empty string, esp. if consecutive lines will be concatenated into a single string.

        Possibly a Mac issue, but not a Windows issue. Perl's IO processing will already have converted CRLF to \n under Windows. The code I posted was tested using Windows.

        However I agree that your regex solution is likely to be better. I'd avoid the "numeric" version though. That makes it more, rather than less, sensitive to OS and character sets.

        Perl converts native line ends to \n (which may or may not be an actual new line character), and sets $/ to \n by default so it doesn't matter what the native OS line end convention is and it doesn't matter what character encoding is used - \n procesing using non-binary mode I/O should be portable with Perl.


        DWIM is Perl's answer to Gödel
Re: Clean data - where field contains a CRLF
by rir (Vicar) on Aug 21, 2006 at 03:20 UTC
    Grandfather's approach to your problem is more direct than this. Starting from your post's title, I anticipated a different question: cleaning data which should have newlines in it, e.g. address is one field directly printable.

    I doubt I have the same trailing trash on my line-ends as you do.

    Be well,
    rir

    #!/usr/bin/perl use warnings; use strict; my $separator = '|'; my $input = ""; while (<DATA>) { $input .= $_; if ($input =~ m/ ^ # start of string ( # group 1 is (?: # group 2 not memo'd is [^|]*\| # any # not-pipe then a pipe ) {26} # repeated 26 times total $ # and reaching the end of string ) /sx # while matching \n with . ) { my $record = $1; $input =~ s/^\Q$record\E\n//s; # trailing trash?? my @arr = split /\|/, $record; # process @arr } } __DATA__ EN|486822|||KKJSKA|L|L00219796|STR, JASON A|JASON|A|STR|||||3710 |NORT +H CANTON|OH|44720|||000|0003053964|I||| EN|486823|||YYYYYY|L|L00738657|OCID, SEAN M|SEAN|M|OCID|||||3846 Foxta +il Lane |CINCINNATI|OH|45248|||000|0009544289|I||| EN|486824|||KXXXXP|L||DSBS, ANDREW J|ANDREW|J|DSBS|||||28835 STILXXXXX +X|FARXXXXX HILLS|MI|48334|||000||I|||
      I think you understand where I was heading. One thought as I slept on it last night was could you split multi-line records using and capturing the EN since it is the first character of the record. Then remove the extra CRLFs except for the last one. I do not know if this is possible using the split command over multiple lines. Thoughts? Thanks Mike
        Yes, you can. Split doesn't care about newlines unless they are in its first argument.

        Be well,
        rir

Re: Clean data - where field contains a CRLF
by Dervish (Friar) on Aug 21, 2006 at 06:00 UTC
    My generic response to this would be "How do you tell when a line end character is not end of record? Is it number of fields? Some special information contained in your data? Once you know the answer to that, then the best way to solve your problem should be easy to figure out." That said, GrandFather's response is educational as always...
Re: Clean data - where field contains a CRLF
by diotalevi (Canon) on Aug 21, 2006 at 13:11 UTC

    Text::CSV_XS comes with support for newlines in fields. You might not be able to use that feature since I generally think it requires that there be some kind of quoting character and I don't see any in your data.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Clean data - where field contains a CRLF
by sxmwb (Pilgrim) on Aug 22, 2006 at 15:08 UTC
    Here is what I cam up with as a solution using split. I was looking at the Perl Cookbook and the ideas here. This seems to work but is a bit slow with a million records.
    #!/usr/bin/perl use strict; my @chunks; { local $/ = undef; @chunks = split(/^(?=(?:EN))/m, <>); } print "Rec: ",scalar(@chunks), " eor\n"; for (my $i=0; $i <= scalar(@chunks); $i++) { $chunks[$i] =~ s/[\r\n]+//g; if ($chunks[$i] =~ /^EN/ ) { print "$chunks[$i]\n"; } }
    The bottom bit was just to prove it worked and print out new clean records. I need to do some addtional error checking but this produces a clean file.

    Any ideas for improvement?

    sxmwb