Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, everyone. I'm a beginner at Perl and programming in general, so apologies if the answer to this question seems obvious to you all.

I'm trying store each word from a txt file as an element in an array, join the elements using a dash, and print the joined string so I can check whether everything was split in the right places (so if there's a dash where I wanted the file to be split, then I know things worked properly).

This was mostly working except that it wouldn't split at newlines, and a little googling told me I might be in over my head trying to fix that, so instead I put the text of the file into a site that removed the newlines for me and then pasted the newline-free text into my txt file. After doing that, suddenly my array only had the first element, which I set as the word NULL instead of pulling it from the text, and nothing else.

Here's the relevant part of my code:

my $in=<STDIN>; my @words; my $joined; $words[0] = 'NULL'; while(my $line = <STDIN>){ @words = grep { /\S/ } split /[:.,\s]/, $line; my $joined = join("-", @words); print "$joined\n"; }

I assume something about the text I copied from the newline-removing site made it unreadable...? The original text with the newlines was copied from a website as well, but it works fine.

Basically: What might be the difference between the two texts? Is there anything I can do to get both versions of the text to work?

Replies are listed 'Best First'.
Re: Can read one txt file and not another?
by GrandFather (Saint) on Jul 29, 2021 at 02:59 UTC

    Why don't you show us samples of the two versions of the text? Start with a very short first text file containing a few words and new lines, then process that to produce the broken version. We can make lots of guesses about how the text files might be wrong and waste our time and yours doing so. Showing us a small sample of what you are actually dealing with will save everyone time and get you a much better answer.

    Why do you read the first line into $in then do nothing with it? Could that be related to your problem?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      Well, I feel silly now. Removing $in appears to have fixed the problem unless it was something seemingly unrelated I did later in the code that did it (but I doubt that). I started with an old code I wrote years ago (took a course that included some Perl and then forgot it) and changed a lot but just hadn't taken $in out.

      Thanks for the help! I was expecting something more complicated and missed a very simple solution.

        Removing $in appears to have fixed the problem unless it was something seemingly unrelated I did later in the code that did it (but I doubt that).
        If I'm understanding your post correctly, it's completely relevant.

        The file that didn't work had all the newlines removed, making the entire file a single line. Since there was only one line in the file, reading the first line into $in and throwing it away... threw away the entire file, leaving nothing for the loop to read.

        This isn't stack overflow, if you find a solution on your own it's appreciated if you provide it with the different data that would be needed to explain it. Real people spend real time on others' problems. Although it's voluntary and a labor of love for some, it is nonetheless appropriate to "wrap up" things. Not only that, PM has the largest collection of the most complicated and interesting Perl questions and solutions in the world. Adding a wrap up will make this post potentially assist another in the future. Right now it's lacking: data requests (from others) and the solution.
Re: Can read one txt file and not another?
by kcott (Archbishop) on Jul 29, 2021 at 04:31 UTC

    As ++GrandFather has already indicated, you're not showing us sufficient information to help you in any meaningful way. Have a read though "How do I post a question effectively?" and "Short, Self-Contained, Correct Example", then provide us with code we can run, as well as both inputs, both actual outputs, and both expected outputs. Wrap all data in <code>...</code> (as you did with your code above) so that we can see a verbatim copy of what you're looking at.

    Copying text from a webpage is highly problematic. Sequences of whitespace (e.g. newlines, tabs, spaces) will all be compressed into a single space. I don't know how you did that but it could be part of your problem.

    I suspect you may have a problem at the conceptual level (that's fine, you're at the learning stage). You're reading input line-by-line; you haven't shown any change to $/; so each line (with the possible exception of the last one) will be terminated with a newline. You can get rid of the newline at the end with chomp. There should not be any embedded newlines to split on.

    "What might be the difference between the two texts?"

    Consider the following series of commands (the '$' signs in the output indicate newlines):

    $ cat > dog_cat_1 dog cat $ cat -vet dog_cat_1 dog$ cat$ $ cp dog_cat_1 dog_cat_2 $ cat -vet dog_cat_2 dog$ cat$ $ perl -pi -e 's/\n//g' dog_cat_2 $ cat -vet dog_cat_2 dogcat

    So the difference is that you start with two distinct entities and end up with just one. See perlrun, and its -i and -p sections for more about that perl command. If you really do need to strip out newlines, doing this yourself is probably a lot less work than passing data to/from a 3rd-party website; and you stay in control of the process.

    — Ken

      Sorry about the vague question! The problem with reading the newline-free file is fixed now -- disappeared when I got rid of the unnecessary $in -- but I'm going to give getting rid of the newlines myself a go after reading your post.

      I definitely have a lot of conceptual gaps, so thank you so much for going into detail!

Re: Can read one txt file and not another?
by eyepopslikeamosquito (Archbishop) on Jul 29, 2021 at 06:08 UTC

    I'm trying store each word from a txt file as an element in an array, join the elements using a dash, and print the joined string so I can check whether everything was split in the right places (so if there's a dash where I wanted the file to be split, then I know things worked properly)

    Displaying with a dash for verification is hardly ideal - what if you want to allow hyphenated words, say? I also think it's more flexible to read from a test file (passed as the first argument to the program) rather than hard-wiring STDIN. You might further like to consider how to write an automated test for this.

    In case it is of use, this is how I'd go about it:

    use strict; use warnings; use Data::Dumper; my $fname = shift or die "usage: $0 fname\n"; open(my $fh, '<', $fname) or die "error: open '$fname': $!"; # Slurp file contents into string $contents my $contents = do { local $/; <$fh> }; close $fh; # Extract what you want, for example my @words = $contents =~ /\w+/g; # ... or split on what you don't want, for example # my @words = split /\s+/, $contents; # In both approaches above you can tweak the regex to suit # Print out the extracted word list to verify for my $word (@words) { print "word='$word'\n"; } # ... or use Data::Dumper print Dumper( \@words ); # ... or write an automated test with specified input and expected out +put

Re: Can read one txt file and not another?
by Marshall (Canon) on Jul 29, 2021 at 22:59 UTC
    First, thank you for leaving your original post alone once you figured out that: my $in=<STDIN>; was the problem. It is ok to update your post noting the change, perhaps: #my $in=<STDIN>; #UPDATE: removing this line solved problem.

    Your seemingly simple question actually brings up a number of fine points. When extracting tokens from a line, there are two basic ways: (1)split and (2)regex match global. The mantra is: "use split when you know what to throw away and use match global when you know what to keep". More in a moment...

    To backtrack a bit, "\s" in Perl lingo means any space character: <FF><LF><CR><TAB><SPACE>. If you split upon "\s+", that will throw away any sequence of consecutive space characters. Your code splits upon a single space, not a potential sequence of spaces. I suspect that [:.,\s]+ would be closer to what you really want, albeit not what you actually want (make the suggested change in the code below and run it for yourself).

    Note: As you see below, I used single quotes around the "@words". In my experience this is a better way to go rather than separating tokens with "-". Mileage varies.

    In Perl you will see (a) split ' ',$line and (b) split /\s+/,$line. That ' ', like many things in Perl is a short-cut that essentially means "do a split on /\s+/, but throw away blank spaces at the beginning of the line. That does not mean to split upon a single character of a literal ' '. Splitting lines upon spaces is the most common form of split and Perl is optimized for that.

    In this particular case, I decided to use 'match global' instead of 'split'. This avoids the problem of having to get rid of leading spaces after the split.

    Many of the files that I process have the possibility of a user interaction that may add one or more blank lines at the end of file. So I almost always skip lines "which have no data". Here is my code. Play with it. Break it. See what changes are necessary for your specific application.

    My textual description above may have some errors in it. This is tricky stuff. Run this code and see what it does.

    use strict; use warnings; while (my $line = <DATA>) { (my @words) = $line =~ /([^:.,\s]+)/g; # (my @words) = split /[:.,\s]+/, $line; #TRY THIS LINE INSTEAD next unless @words; # skip input lines that have no "words" print "\'$_\' " foreach @words; print "\n"; } =prints: Note: that the first data line with only ':' is skipped. 'this' 'is' 'a' 'simple' 'space' 'separated' 'line' 'this' 'is' 'a' 'line' 'with' 'spaces' 'at' 'the' 'beginning' 'this' 'line' 'has' 'multiple' 'spaces' 'embedded' 'in' 'it' 'a' 'comma' 'list' 'a' 'b' 'unconsidered' 'are' '(1)' 'item' 'lists' 'or' '(comments' 'like' 'thi +s)' '$this_is_a_program_variable' 'this' 'shows' '"a' 'quote"' =cut __DATA__ : this is a simple space separated line this is a line with spaces at the beginning this line has multiple spaces embedded in it a comma: list,a,b unconsidered are: (1) item lists or (comments like this) $this_is_a_program_variable this shows "a quote"