Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

newline substitution is removing text too

by Groxx (Novice)
on Jan 21, 2007 at 04:32 UTC ( [id://595723]=perlquestion: print w/replies, xml ) Need Help??

Groxx has asked for the wisdom of the Perl Monks concerning the following question:

This code:
#!/usr/bin/perl # removenewlines use warnings; #use strict; $/ = undef; open IN, shift or die "Failed at opening given file: $!"; $InFile = <IN>; $Output = $InFile; $Output =~ s/\n/things/g; print $Output . "\n";
is behaving oddly, and I'm not sure what the reason is. When it finds a newline character in my CSV file, it not only removes the newline and replaces it with "things", as it should, but it also includes a fair amount of the line of text before it (the amount varies, and it doesn't seem to follow any logic. Sometimes it removes text starting in the middle of a word, sometimes after another character, though in the file I have it's always removing the same amount each time. It seems to only remove less than 100 characters each time (though I've only tested it with a fairly small file).
I know about a few CSV modules, but they aren't serving my needs (ie, they aren't capturing the data correctly 100% of the time), so I'm building what I need as I need it. The main snag has been that one of the CSV files can contain newline characters within a field, so I started making code to replace them. It wasn't working, so I made this to test the behavior... and now I'm stumped.
Below are two of the fields before and after:

An example of a "normal" CSV line (they all have quoted areas):

"inkjet001","Inkjet Printer",4,"prodimages/inkjetprinter.gif","prodima +ges/linkjetprinter.gif",95,90,130,4,0,2.02,1,1,0,0,"",,0,0,0,"This in +kjet printer really packs a punch for the home user. Full color print +s at photo quality. Perfect for everything from letters to the bank m +anager, to printing out your favourite digital family pictures.","Thi +s inkjet printer really packs a punch for the home user. Full color p +rints at photo quality. Perfect for everything from letters to the ba +nk manager, to printing out your favourite digital family pictures.<b +r>As well as a larger image, you can use this ""Long Description"" to + add extra detail or information about your products."
"inkjet001","Inkjet Printer",4,"prodimages/inkjetprinter.gif","prodima +ges/linkjetprinter.gif",95,90,130,4,0,2.02,1,1,0,0,"",,0,0,0,"This in +kjet printer really packs a punch for the home user. Full color print +s at photo quality. Perfect for everything from letters to the bank m +anager, to printing out your favourite digital family pictures.","Thi +s inkjet printer really packs a punch for the home user. Full color p +rints at photo quality. Perfect for everything from letters to the ba +nk manager, to printing out your favourite digital family pictures.<b +r>As well as a larger image, you can use this ""Long Descthings

An example of a "line" with a newline:

"testproduct","Cheap Test Product",3,"prodimages/computercable.gif","p +rodimages/lcomputercable.gif",0.01,0.01,0,0,0,3,1,1,0,0,"",,0,0,0,"Th +is is a cheap product for testing. Note how you can use <b>HTML <font + color='#FF0000'>Markup</font></b> in product descriptions.<br>Also n +ote that as you change the product options, the price changes automat +ically.","This is a cheap product for testing. Note how you can use < +b>HTML <font color='#FF0000'>Markup</font></b> in product description +s.<br> In the long description you can go into more detail about products."
"testproduct","Cheap Test Product",3,"prodimages/computercable.gif","p +rodimages/lcomputercable.gif",0.01,0.01,0,0,0,3,1,1,0,0,"",,0,0,0,"Th +is is a cheap product for testing. Note how you can use <b>HTML <font + color='#FF0000'>Markup</font></b> in product descriptions.<br>Also n +ote that as you change the product options, the price changes automat +ically.","This is a cheap product for testing. Note how you can use < +b>HTML <font color=thingsIn the long description you can go into more + detail about products."

Replies are listed 'Best First'.
Re: newline substitution is removing text too
by graff (Chancellor) on Jan 21, 2007 at 07:58 UTC
    I downloaded your script and your first text sample, ran the script on that text, and did not see the problem that you showed in the second text sample -- instead, I got the output that you were hoping to get. Ditto when running the third text sample (output was as intended, without the truncation you showed in the fourth text sample).

    So, if you're seeing those truncations, I would assume that you are not doing exactly the same thing I did -- either you are not running that exact script, or you are not using exactly those two inputs, or you are doing something else to the data in addition to running that script. Or else you are using something to view the output which is not giving you a faithful presentation of the data.

    (Have you tried using other methods to view and compare the input and the output, e.g. unix tools like "wc" or "od" or "xxd"?)

    As for CSV parsing modules not doing what you want, which particular modules have you tried, and how in particular did they fail to do the right thing for you? I would expect that Text:xSV would be pretty reliable for the kind of data you describe (handling embedded line-feeds within some fields), because that was a particular feature that the module author was intent on getting right.

    (updated to include the link to cpan)

      I don't remember the modules off-hand, but that wasn't one of them. I'll consider trying that one as well, thanks!

      As to the problem, I made it print to a file (previously just to the terminal window), and it worked fine.

      So... know anything about how OSX's bash terminal handles large lines of text, and why it might be cutting that out?

        I was actually using macosx/Terminal myself when trying your sample code, and it looked okay to me. I suppose sometimes if you are using the "more" or "less" pager and resizing the window, things can end up getting a little messy, but there's nothing in the example per se that would cause Terminal to drop or hide characters.
Re: newline substitution is removing text too
by japhy (Canon) on Jan 21, 2007 at 04:37 UTC
    There are probably also carriage returns in the file as well. tr/\r//d or s/\r//g will remove them.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      I just ran a /\r/ search of the file, and nothing came up. While I can see this happening, aren't those also converted to "\n" by Perl automatically, unless told to do otherwise? And even if it didn't, would they show up in other applications?

      And even then, why would it be in the middle of a line of text or HTML tag?

      Thanks for the idea, though, I hadn't thought of that possibility.

Re: newline substitution is removing text too
by rodion (Chaplain) on Jan 21, 2007 at 12:17 UTC
    It's suspicious that both deletions come at quotes (doubled double quotes in the the first instance, and single quotes in the second) combined with your mentioning CSV, which usually needs to do some quotes handling. Also, as graff did, I downloaded your code and example and I didn't see the problem (runing on a Win2000 box, perl v5.8.8).

    Putting these together, I wonder if you might not be running the code you think you're running. If you totally rule that out, then the presence of a non-visible character, other than "\r", maybe the culprit, as graff suggests in in his recomendation to use "od" or "xxd", or in some other way there's somethink going on that doesn't show up in the example text.

      I made it print to a file (previously just to the terminal window), and it worked fine.

      So... know anything about how OSX's bash terminal handles large lines of text, and why it might be cutting that out?

      And thanks for the suggestions, I hadn't heard of those programs before.

Re: newline substitution is removing text too
by NatureFocus (Scribe) on Jan 21, 2007 at 15:18 UTC

    If you edit a single character of a bad file and save it to a new file, does the new file show the problem? If it does start hacking out parts of the file and retesting until you have nothing left but the problem.

    If the problem went away, it might be a control code or something that the editor filtered out, so do a file comparison to see what is different between the two files. If that fails try debug (Windows or the moral equivalent under unix) to see a hex dump on the the two files to see what the difference is.

    -Eugene
      I made it print to a file (previously just to the terminal window), and it worked fine.

      So... know anything about how OSX's bash terminal handles large lines of text, and why it might be cutting that out?

      Thanks for the suggestion, though!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://595723]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 21:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found