seek_m has asked for the wisdom of the Perl Monks concerning the following question:

HI friends,

I would like to read a text file and extract lines of text between 2 identical patterns in a file and put it into a new file?

For example: i want to get all the data between each "start" tag as mentioned below:

Sample input:

start

this is a example line

start

this is a example 1 line

start

this is a example 2 line

start

this is a example 3 line

start

Sample output:

this is a example 1 line

inside a text file.

this is a example 2 line

inside a text file 2. and so on. Thanks in advance for your help.

Update:

closed...no content
  • Comment on extraction of text between 2 similar patterns in a text file

Replies are listed 'Best First'.
Re: extraction of text between 2 similar patterns in a text file
by marinersk (Priest) on May 16, 2011 at 15:40 UTC

    You're looking to write a basic file splitter.

    So what have you tried thus far?

    We can help you when get stuck on Perl, but if what you're stuck on is software design, then you should probably look for softwaredesignmonks.org

    (In other words, we won't do your homework for you, but if you're stuck doing your homework, we can help you get unstuck. "help you" is not synonymous with "do your work for you".)

    For example: You need to read in one file and write a bunch of others, right? So, surely you've written code that reads the file in, right? Show us that code.

    update: Also, please advise if you would like comments on code style, or just code functionality. I won't waste your time if you don't want to hear about my opinion on "better" ways to do something you wrote which is, of its own right, functional to your needs.

      This is my code ,but the problem is that i am missing few lines between couple of same patterns.
      print"enter the input directory path:\n"; chomp($indir=<STDIN>); print"enter the output directory name:\n"; chomp($outdir=<STDIN>); if ($indir eq $outdir) { print"you cannot have same input and ouput directory please change:\n" +; exit(); } else{ chdir ("$indir") or die "$!"; opendir(DIR,".") or die "$!"; my @files=readdir DIR; print @files; close DIR; foreach $file(@files) { unless (($file eq ".") || ($file eq "..") ) { $filer="$indir/$file"; open filein,$filer; while (<filein>) { if(/(\d\_)+a1/.../(\d\_)+a1/) { print; $var3="$outdir/$1 +to+ $2.txt"; open filew,">>$var3" or die "cannot open $out:$!"; print filew $_,"\n"; } } } } print"<----------------------------------------------->\n"; print "\t\t action done\n"; print "\a"; print "\a"; print "\a"; print"<----------------------------------------------->\n"; print"Results could be found in $outdir as txt files with TC name\n"; print"<**********************..............*************************>> +\n"; close $out; }
      Am a novice to programming..need strong basic inputs. Thanks,

        seek_m:

        Since you don't use indentation, it makes your code difficult to read. So the first thing I'd suggest is reformatting it so you can more easily see the structure. I'd next suggest using "use strict;" and "use warnings;" to help find possible problems (such as closing file handles you don't actually have open).

        Then I'd look at the bits you don't understand, and figure out what they mean. For example, what do you want the if statement on line 29 (if(/(\d\_)+a1/.../(\d\_)+a1/)) to do?. Since it governs when you actually emit files, it would be a good place to start.

        Finally, you may want to read about the perl debugger, as executing code under the debugger can be very helpful to both fixing your program as well as understanding how the code operates.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        Excellent; some code to work with.

        Okay, some basic coding structure issues first (and a point or two on style; I'm sorry, I can't help myself). Then we address your issue.

        1. Reformatted for readability. Please download the attached CODE from the bottom of this node. I will be using those line numbers, not your original ones, so please use it to follow along.
        2. Line 3: I strongly recommend you  use strict;  always. It will save you a LOT of hassle. With that one line of code, you just hired your Perl interpretter as a debugger, and it even works for free. It will seem like a husband or wife, continually finding fault with your work. But it will almost always be right, and you will become a better Perl programmer much more quickly.
        3. Line 20: You do  opendir DIR .
          Line 23: You do  close DIR .
          You have made this mistake in previous posts.  open  and  close  are matched;  opendir  and  closedir  are matched.

          It's sort of like getting into your car through the window. You can open the window, and then close the window. Or you can break the window, and then repair the window. You don't open the window and then repair the window. These things are matched sets, and they need to be used as matched sets.

          Side note: Stealing code (especially your own) is a time-honored tradition. You get something that works, you save it, and you re-use it to save yourself time and trouble.

          However, since you made this exact same error in a previous script posted here, it seems you have re-used a broken piece of code instead of a working piece of code. You might consider adjustments to your process to reduce or eliminate the threat of this error in the future.
          Might save you some time and hassle.

        4. Line 29: You do  open filein,$filer; 
          It is common to make the bareword file handle in all caps so it stands out. Not required syntactically, but adviseable for readability.
          I'd do:  open FILEIN,$filer; 
          Note: If you do this, remember you'll have to change all your references from  filein  to  FILEIN .
        5. Line 29: You open a file here but never close it.
        6. Line 36: Again, I'd uppercase the file handle.
        7. Line 36: You open a file here but never close it.
        8. Line 36: You claim the file you cannot open is called  $out , but there is no  $out . You would probably have caught this if you'd been using  use strict; .
        9. Line 50: You do  close $out; . But you never opened it. You would probably have caught this if you'd been using  use strict; .
        10. Line 32: Here is the crux of your problem, once you clean up the rest of your script. Does this regular expression even work? I'm no expert on regular expressions, but this looks funky to me (which is humorous, given that regular expressions generally look funky anyway, but I digress).

          I'd write a test script to confirm how the regular expression will function. Sample:

          #!/usr/bin/perl -w use strict; my @TestInput = ( 'Line One', 'Line Two', 'Line Three', 'Line Four', 'Line Five', ); foreach my $testLine (@TestInput) { print "-----> '$testLine'\n"; if(/(\d\_)+a1/.../(\d\_)+a1/) { my $var3 = "$outdir/$1 +to+ $2.txt"; print " Found '$var3'\n"; } } exit;

        With your code structure repaired,  use strict;  at your side, and the behavior of your regular expression confirmed with the test script, see if you can make the code work.

        And we'll all be here if you get stuck again. Just show us what you tried.

        The reformatted script:

        #!/usr/bin/perl -w use strict; { print"enter the input directory path:\n"; chomp($indir=<STDIN>); print"enter the output directory name:\n"; chomp($outdir=<STDIN>); if ($indir eq $outdir) { print"you cannot have same input and ouput directory please +change:\n"; exit(); } else { chdir ("$indir") or die "$!"; opendir(DIR,".") or die "$!"; my @files=readdir DIR; print @files; close DIR; foreach $file(@files) { unless (($file eq ".") || ($file eq "..") ) { $filer="$indir/$file"; open filein,$filer; while (<filein>) { if(/(\d\_)+a1/.../(\d\_)+a1/) { print; $var3="$outdir/$1 +to+ $2.txt"; open filew,">>$var3" or die "cannot open + $out:$!"; print filew $_,"\n"; } } } } print"<----------------------------------------------->\n"; print "\t\t action done\n"; print "\a"; print "\a"; print "\a"; print"<----------------------------------------------->\n"; print"Results could be found in $outdir as txt files with TC + name\n"; print"<**********************..............***************** +********>>\n"; close $out; } } exit;

Re: extraction of text between 2 similar patterns in a text file
by Anonymous Monk on May 16, 2011 at 15:22 UTC

    Pretend the text is in Chinese (or something else you can't understand), all the writing is on notepads and that you're so drunk you can't remember what you did 10 seconds ago.
    Now, write down a set of instructions so that your imaginary self could accomplish the task of copying the desired symbols from the first notepad to the stack of blank ones.

    Replace "notepad" with "file" and the algorithm should now be easy to translate into perl.

Re: extraction of text between 2 similar patterns in a text file
by ww (Archbishop) on May 16, 2011 at 17:26 UTC
    For starters, what does "closed...no comment" in your "Update" mean? Did you despair of getting code to do the specific job? Did you find an answer that led you in the right direction? or something else?

    Then, please clarify what your sample output is telling us: on first reading I thought you wanted one output file with all the material except the "start" lines... but, on rereading, the phrase "inside a text file 2. and so on" (belatedly) caught my eye, making me wonder if you wanted each instance of the boldfaced material to go in a separate file (and no, I didn't bother to try to answer that question by studying the code you posted: it's not easily readable and less than readily comprehensible for reasons outlined by others).

    But, the Perl slogan "There is more than one way to do it" (aka TIMTOWTDI or TIMTOADY, etc)" is definitely a truism for your problem, however one is supposed to understand your question.

    • split is one possibility
    • changing the file_input_separator (Read about $/)
    • using a regular expression (example provided because this is *NOT* the best alternative):
      #!/usr/bin/perl use warnings; use strict; use 5.012; # 905094 my @data = <DATA>; my $data; my $file = "f:/_wo/905094.txt"; open ( OUT, "> $file" ) or die "Can't open FH for write, $!"; for $data(@data) { chomp $data; if ($data !~ /start/ ) { # say $data; say OUT $data . "\t|"; } } say "done"; =prints to the file: this is a example line | | this is a example 1 line | | this is a example 2 line | | this is a example 3 line | | =cut __DATA__ start this is a example line start this is a example 1 line start this is a example 2 line start this is a example 3 line start

    and there are more.

    This site, including Super Search and Tutorials can help you get started with Perl (yes, I suspect the reply code is cargo-culted from somewhere without much understanding). So too can numerous on-line tutorials (for example, those at http://learn.perl.org/) but search this site a bid to see which are well-regarded and which are trash) and, finally, being very much "old-school," I also suggest ( tada!! ) "books" such as Learning Perl.

    Addendum:

    Your title, if carefully chosen, suggests you are thinking of the "this is a example \d line" as the "similar patterns," perhaps because those are what you want to capture. But for your purposes, think instead of what you want to discard; instances of "stop" (with or without blank lines preceding).

    It's easy enough to write a regex to deal with the "this is..." lines -- matching on any digit and capturing work just fine (when the regex is constructed correctly), but there's a lot less opportunity for error (in your example data) to match on the four letter word, "stop."