in reply to Clean data - where field contains a CRLF

The following works for the sample data you have given. Note the hard wired field count and that the code will die if something bad happens.

use strict; use warnings; use constant FIELDS => 26; my $line = ''; while (<DATA>) { s/\r//g; chomp; $line .= $_; my $fields = $line=~ tr/|//; next if FIELDS > $fields; die "Field count too great in line $." if FIELDS < $fields; my @fields = split /\|/, $line; $line = ''; print join ' ', @fields, "\n\n"; } __DATA__ EN|486822|||KKJSKA|L|L00219796|STR, JASON A|JASON|A|STR|||||3710 |NORT +H CANTON|OH|44720|||000|0003053964|I||| EN|486823|||YYYYYY|L|L00738657|OCID, SEAN M|SEAN|M|OCID|||||3846 Foxta +il Lane |CINCINNATI|OH|45248|||000|0009544289|I||| EN|486824|||KXXXXP|L||DSBS, ANDREW J|ANDREW|J|DSBS|||||28835 STILXXXXX +X|FARXXXXX HILLS|MI|48334|||000||I|||

Prints:

EN 486822 KKJSKA L L00219796 STR, JASON A JASON A STR 3710 NORT +H CANTON OH 44720 000 0003053964 I EN 486823 YYYYYY L L00738657 OCID, SEAN M SEAN M OCID 3846 Foxta +il Lane CINCINNATI OH 45248 000 0009544289 I EN 486824 KXXXXP L DSBS, ANDREW J ANDREW J DSBS 28835 STILXXXXX +X FARXXXXX HILLS MI 48334 000 I

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re^2: Clean data - where field contains a CRLF
by graff (Chancellor) on Aug 21, 2006 at 01:08 UTC
    Minor nitpick, Grampa:
    # s/\r//g; # chomp; # expressed better (less platform dependent) as: s/[\r\n]+//g; # or, to be compulsive, use the numerics: s/[\x0a\x0d]+//g;
    According to the perl docs I've seen, chomp "removes any trailing string that corresponds to the current value of $/".

    If perl has $/ set to "\r\n", taking away the "\r" before chomping might cause the chomp to do nothing at all. (But I'm not a windows user, so I could be wrong about that.)

    Also, depending on the data and the task, it might make more sense to replace every [\r\n]+ with a space, rather than an empty string, esp. if consecutive lines will be concatenated into a single string.

      Possibly a Mac issue, but not a Windows issue. Perl's IO processing will already have converted CRLF to \n under Windows. The code I posted was tested using Windows.

      However I agree that your regex solution is likely to be better. I'd avoid the "numeric" version though. That makes it more, rather than less, sensitive to OS and character sets.

      Perl converts native line ends to \n (which may or may not be an actual new line character), and sets $/ to \n by default so it doesn't matter what the native OS line end convention is and it doesn't matter what character encoding is used - \n procesing using non-binary mode I/O should be portable with Perl.


      DWIM is Perl's answer to Gödel