extract line

lallison has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extract line by mtmcc (Hermit) on Jul 20, 2013 at 17:26 UTC
I'm not sure where the "ROCKER ARM ASSEMBLIES AND GROUPS" is coming from, but if I understand what you're trying to do, I think you might be over complicating it. Something like this should work: `#! /usr/bin/perl use strict; use warnings; my $serialNumbers = $ARGV[0]; my $partNumbers = $ARGV[1]; my @line; my %partNumber; open (my $fileTwo, "<", $partNumbers); while (<$fileTwo>) { chomp; $partNumber{$_} = 1; } open (my $fileOne, "<", $serialNumbers); while (<$fileOne>) { @line = split (":", $_); print STDERR "$_" if exists $partNumber{$line[0]}; }` [download] I hope that helps.	[reply] [d/l]
Re^2: extract line by lallison (Novice) on Jul 20, 2013 at 21:10 UTC
Thank you, I will give this a try. This file has over a million lines and is running very slow. Is there a quick way to control the buffering.	[reply]
Re^3: extract line by mtmcc (Hermit) on Jul 20, 2013 at 22:02 UTC
I'm afraid not that I can see. I think the split probably slows it down, but I don't think it's avoidable. If you do find a faster way, please let me know! Best of luck.	[reply]
Re^2: extract line by lallison (Novice) on Jul 20, 2013 at 22:59 UTC
Not very familiar with argv. Where am I giving the program my file names? I am also printing to a outfile. I tried it as written and receive an uninitialized value $partNumbers but believe this is due to the filename.	[reply]
Re^3: extract line by mtmcc (Hermit) on Jul 20, 2013 at 23:23 UTC
`$ARGV[0]` is the first argument on the command line (name of file with longer 'serial numbers'), and `$ARGV[1]` is the second argument on the command line (file containing the integers). To run it: `script.pl serialnumbers.txt integers.txt` To print to a third file, add this somewhere before the second while loop: `open (my $output, ">", 'nameOfOutputFile.txt');` and change `print STDERR` to `print $output`. I hope that works!	[reply] [d/l] [select]
Re^4: extract line by lallison (Novice) on Jul 21, 2013 at 16:20 UTC
Re^5: extract line by frozenwithjoy (Priest) on Jul 21, 2013 at 16:38 UTC
Re: extract line by frozenwithjoy (Priest) on Jul 20, 2013 at 17:15 UTC
I made a hash containing all of the part numbers of interest (file 1) then I read through file 2 and printed any lines that have a part number that exists in the hash of part numbers: `#!/usr/bin/env perl use strict; use warnings; my %part_nums = map { $_ => 1 } qw(3478749 3633731); while ( my $line = <DATA> ) { my ($part) = split /:/, $line; print $line if exists $part_nums{$part}; } __DATA__ 3478748:AA:1D:AAA:Descriptions:C:2 3478749:AA:1D:AAA:Descriptions:C:2 3633731:AA:3E:AAA:Descriptions:C:2` [download] OUTPUT: `3478749:AA:1D:AAA:Descriptions:C:2 3633731:AA:3E:AAA:Descriptions:C:2` [download]	[reply] [d/l] [select]
Re: extract line by kcott (Archbishop) on Jul 21, 2013 at 06:45 UTC
G'day lallison, Welcome to the monastery. This code does what you describe as being wanted: `$ perl -Mstrict -Mwarnings -e ' use autodie; use Tie::File; my $re = qr{^((\d+).+$)}s; my %data_for_part; open my $f1, "<", "pm_1045452_file1.txt"; while (<$f1>) { /$re/; $data_for_part{$2} = $1; } close $f1; tie my @file2, "Tie::File", "pm_1045452_file2.txt"; print $data_for_part{$_} for @file2; untie @file2; ' 3478749:AA:1D:AAA:DescriptionsY:C:2 3633731:AA:3E:AAA:DescriptionsZ:C:2` [download] I made a minor change to "File1" to show different `Descriptions`: `$ cat pm_1045452_file1.txt 3478748:AA:1D:AAA:DescriptionsX:C:2 3478749:AA:1D:AAA:DescriptionsY:C:2 3633731:AA:3E:AAA:DescriptionsZ:C:2` [download] "File2" data is as you show it: `$ cat pm_1045452_file2.txt 3478749 3633731` [download] Notes: You don't need to chomp any input nor add any newlines to the output. There's no temporary arrays to process. Tie::File comes standard with Perl: you won't need to install it. "This file has over a million lines and is running very slow." Given that you've been provided with a number of solutions, use Benchmark to determine which works best for you. (That module also comes standard with Perl.) [Aside: The code you posted is difficult to read due to the `<code>` tag issue. You appear to have made an effort but were unsuccessful: see Writeup Formatting Tips for how, where and why to do it.] -- Ken	[reply] [d/l] [select]
Re^2: extract line by lallison (Novice) on Jul 21, 2013 at 17:28 UTC
are you running the file with cat pm_1045452_file2.txt statement?	[reply]
Re^3: extract line by kcott (Archbishop) on Jul 22, 2013 at 08:50 UTC
If you're unfamiliar with nix OSes, perhaps what I posted requires a little further explanation: The actual code I ran is the "`perl -Mstrict -Mwarnings -e ' ... '`"* part (see perlrun). The two lines immediately following that second single quote is the output produced by the print statement. `cat` is a commonly used nix command (unrelated to Perl) that prints the contents of file(s). You can read "`$ cat pm_1045452_file1.txt`"* as "Here's the contents of the file `pm_1045452_file1.txt`:". This is entirely unrelated to the Perl code; it merely shows the data the Perl code is using (which, as stated, I had slightly modified). [In case you didn't know, "nix" is just an umbrella term for any UNIX-like OS.]* -- Ken	[reply] [d/l] [select]
Re^2: extract line by lallison (Novice) on Jul 21, 2013 at 17:46 UTC
what should $2 refer to?	[reply]
Re^3: extract line by kcott (Archbishop) on Jul 22, 2013 at 09:09 UTC
`$1`, `$2` (and `$3`, ..., etc.) hold the data that has been captured by a regular expression. This is fairly basic stuff, so perhaps a read of the perlretut - Perl regular expressions tutorial would be useful; it's section "Extracting matches" has a better, and more detailed, description than my one sentence. -- Ken	[reply] [d/l] [select]
Re: extract line by poj (Abbot) on Jul 20, 2013 at 18:27 UTC
If you are curious to know why your script fails then perhaps this explains it. Essentially you are matching a file1 value with a substring extracted from the same value while (my $line = <FILE1>) { # this sets $filerecord[0] to a file1 value my @filerecord = $line; for (@goodarray){ my $data = $_; # this sets $arrfield[0] only to a file2 value my @arrfields = $data; # this removes line ending from $arrfield[0] # which holds the file2 value chomp (@arrfields); # this trims $filedata[0] which refers # to $filerecord[0] the file1 value my $filedata =\@filerecord; ${$filedata}[0] =~ s/^\s+\|\s+$//g; # this trims the file2 value $arrfields[0] =~ s/^\s+\|\s+$\|-//g; #$arrfields[0] =~ s/-//g; # this sets $string to trimmed file1 value my $string = ${$filedata}[0]; # this extract number from start of file1 value if ($string =~ /(^\d{7,8})/ ){ # $substr holds number from file1 my $substr = $1; # this matches file1 value to the number # extracted from file1 value # so will allways match if (index($string, $substr) !=-1){ #print "$string\n"; #last; } # this would work matching file1 value to file2 value if (index($string, $arrfields[0]) != -1){ print "$string\n"; last; } } } } [download] poj	[reply] [d/l]
Re^2: extract line by lallison (Novice) on Jul 20, 2013 at 22:52 UTC
Printing the string this way still give me all the lines from file1. I even changed your line: if (index($string, $arrfields[0]) != -1){ print "$string\n"; last to if (index(${$filedata}[0], $arrfields[0]) != -1){ print "${$filedata}[0]\n"; last It is not making the connection to grab the line with that part. It prints them all up to the number that ends the $arrfields[0].This file has over 1 mil lines so I cannot use one liners	[reply]
Re^3: extract line by poj (Abbot) on Jul 21, 2013 at 07:35 UTC
Without seeing all your code and an example of the data set that is failing to connect it is difficult for me to explain it. However this line `if ($string =~ /(^\d{7,8})/ )` suggests your have both 7 and 8 digit numbers in which case using `index` will give you incorrect results. For example `1234567` will match numbers `11234567,21234567, etc` as well as `12345670,12345671, etc`. You could use an exact match `if ($substr eq $arrfields[0]){ print "$string\n"; last; }` [download] but if speed is important then I suggest you use one of the hash based solution other monks have provide like this . #!/usr/bin/perl use strict; use warnings; # start my $t0 = time(); my $file1 = 'file1.txt'; my $file2 = 'file2.csv'; my $outfile = 'final_lines.txt'; # run once #testdata(); my %file2=(); open FILE2, '<',$file2 or die "Could not open $file2 $!"; while (<FILE2>){ s/[\r\n]//g; $file2{$_} = 1; } my $dur = time() - $t0; print "$. records read from $file2 in $dur seconds\n"; close FILE2; $t0 = time(); open OUTFILE,'>',$outfile or die "Could not open $outfile $!"; open FILE1, '<',$file1 or die "Could not open $file1 $!"; my $count_out=0; while (<FILE1>){ my ($id,undef) = split /:/; if (exists $file2{$id}){ print OUTFILE $_; ++$count_out; } } $dur = time() - $t0; print "$. records read from $file1 in $dur seconds\n"; close FILE1; close OUTFILE; print "$count_out records written to $outfile\n"; # some random data sub testdata { my $count; my @char = ('A'..'Z','a'..'z','0'..'9'); open OUT1,'>',$file1 or die "$file2 $!"; open OUT2,'>',$file2 or die "$file2 $!"; for (my $i=1_000_000;$i<=99_999_999;$i+=99){ my @chars = map{ $char[int(rand(62))] }(1..60); my $line = ':'.(join '',@chars); print OUT1 ($i + int rand(99))."$line\n"; print OUT2 ($i + int rand(99))."\n"; ++$count; } close OUT1; close OUT2; print "$count records created in $file1 and $file2\n"; } [download] poj	[reply] [d/l] [select]
Re: extract line by Laurent_R (Canon) on Jul 20, 2013 at 17:54 UTC
Yep, that't the right (and standard) way: open file2, load its content into a hash (each part nu�mber as a key to the hash), close file2; then scan file1 line by line, check if the part number is in the hash, print the line it it is. This is also very fast, since loopup in a hash is fast. It breaks only if file2 is so huge that it will not fit into memory.	[reply]
Re: extract line by Loops (Curate) on Jul 20, 2013 at 18:38 UTC
Probably there is a shorter one liner than this: `perl -F: -ane 'BEGIN {open K, "<file2"; $h{0+$_}=1 for <K>} print if e +xists $h{$F[0]}' file1` [download]	[reply] [d/l]
Re^2: extract line by rjt (Curate) on Jul 20, 2013 at 20:56 UTC
How about this? `perl -pe'open _,file2;0+$_~~[<_>]or$_=""' file1` 47 chars, versus 91 (84, removing unnecessary whitespace). This one's definitely "just for fun", though, due to now-experimental smartmatch and re-reading of file2.	[reply] [d/l]