Sekhar Reddy has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have to Merge and split files based on number of lines(configurable parameter), Using shell, I did but it took more time. Note that i have thousands of files which contains almost 60 Million lines.

Using Perl, I am trying to

1.Read each file and

2.Redirecting the line to output,

3.Keep a counter on number of lines, once matches create a new output file and redirect output to that

Is there any better way of doing this instead of reading all files/any perl module?

Thank you in Advance for your valuable time.

Regards, Chanti
  • Comment on Merge and split files based on number of lines

Replies are listed 'Best First'.
Re: Merge and split files based on number of lines
by tobyink (Canon) on Jan 29, 2019 at 12:42 UTC

    Not a Perl answer, but most Linux/Unix systems should have /usr/bin/split which can split files based on byte counts or lines.

    If you really need to use Perl, shelling out to /usr/bin/split would probably be quickest and easiest.

Re: Merge and split files based on number of lines
by Veltro (Hermit) on Jan 29, 2019 at 14:37 UTC

    I have created a quick experiment based on your question. Not sure what it is worth but I tried to use the 'chunk_size' facility in MCE to achieve what you want

    Since you didn't show any data to test on I had to make some assumptions on how it looks like and created a script to generate some data files

    The program starts up a Perl data reader program using open3 and feeds the results to the MCE 'mce_loop_f'. The chunk size eventually determines how many lines are being output in a new file. Files are written to a 'output' subfolder

    I say again, just is just a funny experiment of mine and maybe someone else can say something about performance, but hey, it might do the job!

    test.pl

    use strict ; use warnings ; use IPC::Open3 qw( open3 ) ; use MCE::Loop ; MCE::Loop::init { max_workers => 3, chunk_size => 10 } ; my $cmd = 'perl readfiles.pl' ; sub do_work { my @ar = @{$_[0]} ; my $id = $_[1] ; open( my $fho, ">", "output\\$id.txt" ) or die "Can't open > outpu +t\\$id.txt: $!" ; foreach ( @ar ) { print $fho $_ ; } close($fho) ; } my $pid = open3( undef, my $filedata, undef, $cmd, @_, ) ; print STDERR "Starting MCE loop\n" ; mce_loop_f { my ( undef, $chunk_ref, $chunk_id ) = @_ ; do_work( $chunk_ref, $chunk_id ) ; } $filedata ; print STDERR "Ended\n" ; waitpid( $pid, 0 ) ; warn "Failed\n" if ( $? << 8 != 0 ) ; close($filedata) ;

    readfiles.pl

    use strict ; use warnings ; my @files = qw( data1.txt data2.txt data3.txt ) ; foreach ( @files ) { open( my $fh, "<", $_ ) or die "Can't open < $_: $!" ; while(<$fh>) { chomp; print $_ . "\n" ; } close( $fh ) ; }

    createdatafiles.pl

    use strict ; use warnings ; open( my $fh, ">", "data1.txt" ) or die "Can't open > data1.txt: $!" ; for my $i ( 1..1000 ) { print $fh "$i\n" ; } close $fh ; open( $fh, ">", "data2.txt" ) or die "Can't open > data2.txt: $!" ; for my $i ( 1001..2000 ) { print $fh "$i\n" ; } close $fh ; open( $fh, ">", "data3.txt" ) or die "Can't open > data3.txt: $!" ; for my $i ( 2001..3000 ) { print $fh "$i\n" ; } close $fh ;

      Below is the code i have written, If this can be tuned in better way, Please suggest

      Also i have one more challenge ie., My files are in multiple directories, But with get options, i can give only one path as inputpath. Is there any way that i can give multiple paths like something like ex: myinputpath/*/myfolder/filenames*

      SAMPLE INPUT

      file1:

      11111

      22222

      33333

      44444

      file2:

      55555

      66666

      77777

      88888

      file3:

      99999

      00000

      My Output should be like below

      if i give split count as 2, script should generate 5 files

      Outputfile1:

      11111

      22222

      Outputfile2:

      33333

      44444

      Outputfile3:

      55555

      66666

      and so on..

      Outputfile5:

      99999

      00000

      use strict; use warnings; use Getopt::Long qw(GetOptions); use File::Basename; use POSIX qw/strftime/; my $inputdir; my $OPCO; my $OutPath; my $LogFile; ################ GetOptions( 'inputdir=s' => \$inputdir, 'opco=s' => \$OPCO, 'outputdir=s' => \$OutPath, 'logfile=s' => \$LogFile, ) or die "Usage: $0 --inputdir inputPath --opco OPCO --outputd +ir OUTPUT_PATH --logfile LOG_FILE \n"; my $FILESPLITCNT=2; my $log = $LogFile; open(my $flog, '>>:encoding(UTF-8)', $log) or die "ERR: Could not open file (to write): '$log' $!"; my $finputDir = $inputdir; opendir(DIR, $finputDir) or die "ERR: Can't open directory $finputDir: + $!"; my @files = readdir(DIR); my $CDRCNT=0; my $initSeq=0; my $SEQNUM = sprintf("%03d",$initSeq); my $out = $OutPath."/".$OPCO."_SS_".strftime('%Y%m%d_%H%M%S',localtime +)."_".$SEQNUM.".cdr"; open(my $fout, '>>:encoding(UTF-8)', $out) or die "ERR: Could not open file (to write): '$out' $!"; print STDOUT "OUTPUTFILE PATH IS:".$out."\n"; foreach my $file (@files) { if ( -f $finputDir . "/" . $file) { my($filename, $dirs, $suffix) = fileparse($file); my $finput = $finputDir . $filename; my $fileLength = length($filename); if($fileLength < 50) { #KEEP EXTRA SPACES to filename #$filename .= (" " * (50 - length($filename))) } open(my $fin, '<:encoding(UTF-8)', $finput) or die "Could not +open file (to read) '$finput' $!"; print $flog "Processing File:" . $finput . "\n"; while (my $row = <$fin>) { chomp $row; if($CDRCNT == $FILESPLITCNT ) { close $fout; $CDRCNT = 0; $SEQNUM = $SEQNUM + 1; $out = $OutPath.$OPCO."_SS_".strftime('%Y%m%d_%H%M%S', +localtime)."_".sprintf("%03d",$SEQNUM).".cdr"; open( $fout, '>>:encoding(UTF-8)', $out) or die "ERR: Could not open file (to write): '$out +' $!"; } print $fout "$filename;$row\n"; $CDRCNT = $CDRCNT + 1; } undef $finput; close $fin; } } close $fout; close $flog;

        Hi Sekhar Reddy, I ran your program and the results are already as you described.

        Is there any way that i can give multiple paths like something like ex: myinputpath/*/myfolder/filenames*

        Actually I would not adapt this particular kind of program to do this kind of thing automatically. If you really want to then I would look at the examples for readdir, there are ways to specify grep patterns while executing readdir. But personally I would think that the chance that wrong input slips in is too high. Also maybe more flexibility to be able to manipulate the input could be required.

        What I would do in this case is create an external configuration file that will contain all the files or folders that the program needs to scan. To obtain these folders I would write a small helper batch-file or perl script that is going to obtain all these folders for me. Example for Windows:

        # Note that this is also just a quick scribble and you may need to ada +pt this to your needs # Folders dir /b /s /a:d | perl -ne "print $_ if $_=~/^c:\\myinputpath\\.*?\\myf +older\\.*?$/" > filesconfig.txt # Files dir /b /s | perl -ne "print $_ if $_=~/^c:\\myinputpath\\.*?\\myfolder +\\filenames.*?$/" > foldersconfig.txt

        Once I have obtained the config file I can now open it and check if the contents is correct and also I can make enhancements in the sorting order of the folders or files that are being processed. The next thing you can do is adapt your script:

        'inputdir=s' => \$inputdir, -> 'configfile=s' => \$configfile,

        But again, this is just a quick scribble and there may be better solutions out there:

        open( my $cfg, "<", $configfile ) or die "Can't open < $configfile: $! +" ; my @files = () ; while( my $f = <$cfg>) { chomp $f ; # print "f=$f\n" ; opendir(my $dh, $f) or die "ERR: Can't open directory $f: $!"; push @files, map { "$f\\$_" } grep { -f "$f\\$_" } readdir($dh); closedir($dh) ; } foreach ( @files ) { print "$_\n" ; }
Re: Merge and split files based on number of lines
by Eily (Monsignor) on Jan 29, 2019 at 13:22 UTC

    I think you are looking for the -n (or -p) option? It will execute the perl code (generally a oneliner with -E, but it can also work with a file) over each line of each file passed as an argument, and sets $_ to the content of the line. If you use the -n option you have to print whatever output you need yourself, with the -p option it prints the (possibily modified) content of $_ after each line. The variable $ARGV contains the name of the currently processed file.

    Eg: perl -nE 'say "$ARGV: $_" if /Valid/' file1 file2 file3
    will print all the lines from each three files that contain "Valid".

Re: Merge and split files based on number of lines
by GrandFather (Saint) on Jan 29, 2019 at 20:50 UTC

    In principle what you describe is fairly trivial, but I wonder if there is a bigger picture here that is worth taking into account? Unless you are doing this as an exercise, splitting the files up sounds like the start of a processing pipeline. You may get significant advantage by combining processing steps in the pipeline by avoiding multiple reads and writes to disk for the same data. We can probably help you better if you describe the complete process.

    Your problem specification isn't great. Do you want to combine files if they have fewer than the output number? If you aren't combining files do you want to write a short file when you run out of lines from the input file or do something else (if so, what)? How do you want the output files named? Do the output files go in the same folder as the input files?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Merge and split files based on number of lines
by Marshall (Canon) on Jan 30, 2019 at 00:37 UTC
    Hi Chanti!

    Like Grandfather, I am having some trouble really understanding the overall objective/work flow. From what you are describing, sounds like you want to send the first N lines of a file to a pipe and save the "leftover" lines (if any such lines exist) to another file for future processing at a later time?

    Physically on the disk, no matter what tools you use, this means reading the entire input file. The first N lines would be sent to the pipe for processing by another program and TotalInputFileLines -N need to be written back to the disk.

    You can determine the number of bytes in the input file without reading it (this is a number that the file system alredy knows). But counting the lines requires reading the data and looking for line endings.

    My first question is: Why save totalLines-N lines back to the disk? Why not just process them now? That way you only read all of the data once and you don't have to save raw unprocessed data back to the disk.

    Another question: What percentage of the input file is typically processed? This could matter. If the percentage is "small", then it might make sense to a) determine the current byte offset, "X", b)close input file, re-open in binary mode, throw away the first X bytes, copy all remaining bytes to the new file. This would require some experimentation. But binary file operations are faster than text mode operations because there is no searching for line endings.

    It could be faster if the files you are writing and ones you are reading are on different physical disk drives.

    Any performance data or other info could help us help you. A few thousand files and 60m lines is not particularly intimidating.

    Update with another comment: There can be some performance issues with your processing pipeline. The pipe has a finite capacity. The sender can't spew it out any faster than the receiver can take it. There are solutions to these sort of problems, but more info is needed.