Clean data - where field contains a CRLF

sxmwb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Clean data - where field contains a CRLF by GrandFather (Saint) on Aug 20, 2006 at 22:51 UTC
The following works for the sample data you have given. Note the hard wired field count and that the code will die if something bad happens. use strict; use warnings; use constant FIELDS => 26; my $line = ''; while (<DATA>) { s/\r//g; chomp; $line .= $_; my $fields = $line=~ tr/\|//; next if FIELDS > $fields; die "Field count too great in line $." if FIELDS < $fields; my @fields = split /\\|/, $line; $line = ''; print join ' ', @fields, "\n\n"; } __DATA__ EN\|486822\|\|\|KKJSKA\|L\|L00219796\|STR, JASON A\|JASON\|A\|STR\|\|\|\|\|3710 \|NORT +H CANTON\|OH\|44720\|\|\|000\|0003053964\|I\|\|\| EN\|486823\|\|\|YYYYYY\|L\|L00738657\|OCID, SEAN M\|SEAN\|M\|OCID\|\|\|\|\|3846 Foxta +il Lane \|CINCINNATI\|OH\|45248\|\|\|000\|0009544289\|I\|\|\| EN\|486824\|\|\|KXXXXP\|L\|\|DSBS, ANDREW J\|ANDREW\|J\|DSBS\|\|\|\|\|28835 STILXXXXX +X\|FARXXXXX HILLS\|MI\|48334\|\|\|000\|\|I\|\|\| [download] Prints: `EN 486822 KKJSKA L L00219796 STR, JASON A JASON A STR 3710 NORT +H CANTON OH 44720 000 0003053964 I EN 486823 YYYYYY L L00738657 OCID, SEAN M SEAN M OCID 3846 Foxta +il Lane CINCINNATI OH 45248 000 0009544289 I EN 486824 KXXXXP L DSBS, ANDREW J ANDREW J DSBS 28835 STILXXXXX +X FARXXXXX HILLS MI 48334 000 I` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Clean data - where field contains a CRLF by graff (Chancellor) on Aug 21, 2006 at 01:08 UTC
Minor nitpick, Grampa: `# s/\r//g; # chomp; # expressed better (less platform dependent) as: s/[\r\n]+//g; # or, to be compulsive, use the numerics: s/[\x0a\x0d]+//g;` [download] According to the perl docs I've seen, chomp "removes any trailing string that corresponds to the current value of $/". If perl has $/ set to "\r\n", taking away the "\r" before chomping might cause the chomp to do nothing at all. (But I'm not a windows user, so I could be wrong about that.) Also, depending on the data and the task, it might make more sense to replace every `[\r\n]+` with a space, rather than an empty string, esp. if consecutive lines will be concatenated into a single string.	[reply] [d/l] [select]
Re^3: Clean data - where field contains a CRLF by GrandFather (Saint) on Aug 21, 2006 at 02:37 UTC
Possibly a Mac issue, but not a Windows issue. Perl's IO processing will already have converted CRLF to \n under Windows. The code I posted was tested using Windows. However I agree that your regex solution is likely to be better. I'd avoid the "numeric" version though. That makes it more, rather than less, sensitive to OS and character sets. Perl converts native line ends to \n (which may or may not be an actual new line character), and sets $/ to \n by default so it doesn't matter what the native OS line end convention is and it doesn't matter what character encoding is used - \n procesing using non-binary mode I/O should be portable with Perl. DWIM is Perl's answer to Gödel	[reply]
Re: Clean data - where field contains a CRLF by rir (Vicar) on Aug 21, 2006 at 03:20 UTC
Grandfather's approach to your problem is more direct than this. Starting from your post's title, I anticipated a different question: cleaning data which should have newlines in it, e.g. address is one field directly printable. I doubt I have the same trailing trash on my line-ends as you do. Be well, rir #!/usr/bin/perl use warnings; use strict; my $separator = '\|'; my $input = ""; while (<DATA>) { $input .= $_; if ($input =~ m/ ^ # start of string ( # group 1 is (?: # group 2 not memo'd is [^\|]*\\| # any # not-pipe then a pipe ) {26} # repeated 26 times total $ # and reaching the end of string ) /sx # while matching \n with . ) { my $record = $1; $input =~ s/^\Q$record\E\n//s; # trailing trash?? my @arr = split /\\|/, $record; # process @arr } } __DATA__ EN\|486822\|\|\|KKJSKA\|L\|L00219796\|STR, JASON A\|JASON\|A\|STR\|\|\|\|\|3710 \|NORT +H CANTON\|OH\|44720\|\|\|000\|0003053964\|I\|\|\| EN\|486823\|\|\|YYYYYY\|L\|L00738657\|OCID, SEAN M\|SEAN\|M\|OCID\|\|\|\|\|3846 Foxta +il Lane \|CINCINNATI\|OH\|45248\|\|\|000\|0009544289\|I\|\|\| EN\|486824\|\|\|KXXXXP\|L\|\|DSBS, ANDREW J\|ANDREW\|J\|DSBS\|\|\|\|\|28835 STILXXXXX +X\|FARXXXXX HILLS\|MI\|48334\|\|\|000\|\|I\|\|\| [download]	[reply] [d/l]
Re^2: Clean data - where field contains a CRLF by sxmwb (Pilgrim) on Aug 21, 2006 at 13:18 UTC
I think you understand where I was heading. One thought as I slept on it last night was could you split multi-line records using and capturing the EN since it is the first character of the record. Then remove the extra CRLFs except for the last one. I do not know if this is possible using the split command over multiple lines. Thoughts? Thanks Mike	[reply]
Re^3: Clean data - where field contains a CRLF by rir (Vicar) on Aug 21, 2006 at 13:41 UTC
Yes, you can. `Split` doesn't care about newlines unless they are in its first argument. Be well, rir	[reply] [d/l]
Re: Clean data - where field contains a CRLF by Dervish (Friar) on Aug 21, 2006 at 06:00 UTC
My generic response to this would be "How do you tell when a line end character is not end of record? Is it number of fields? Some special information contained in your data? Once you know the answer to that, then the best way to solve your problem should be easy to figure out." That said, GrandFather's response is educational as always...	[reply]
Re: Clean data - where field contains a CRLF by diotalevi (Canon) on Aug 21, 2006 at 13:11 UTC
Text::CSV_XS comes with support for newlines in fields. You might not be able to use that feature since I generally think it requires that there be some kind of quoting character and I don't see any in your data. ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply]
Re: Clean data - where field contains a CRLF by sxmwb (Pilgrim) on Aug 22, 2006 at 15:08 UTC
Here is what I cam up with as a solution using split. I was looking at the Perl Cookbook and the ideas here. This seems to work but is a bit slow with a million records. `#!/usr/bin/perl use strict; my @chunks; { local $/ = undef; @chunks = split(/^(?=(?:EN))/m, <>); } print "Rec: ",scalar(@chunks), " eor\n"; for (my $i=0; $i <= scalar(@chunks); $i++) { $chunks[$i] =~ s/[\r\n]+//g; if ($chunks[$i] =~ /^EN/ ) { print "$chunks[$i]\n"; } }` [download] The bottom bit was just to prove it worked and print out new clean records. I need to do some addtional error checking but this produces a clean file. Any ideas for improvement? sxmwb	[reply] [d/l]