question about finding strings?

SaraBetsy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: question about finding strings? by roboticus (Chancellor) on Nov 28, 2017 at 00:27 UTC
SaraBetsy: Neither: it only reads from the file (the '<' part of the open statement). It won't change it or add anything to it. The print statement is sending what it finds to the standard output stream, which will normally write to the console. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: question about finding strings? by NetWallah (Canon) on Nov 28, 2017 at 05:10 UTC
The modifiers should be OUTSIDE the regex, as in: `m/ ( .{0,25} $string.{0,25} ) /gisx` [download] And no - it will not modify your original file. All power corrupts, but we need electricity.	[reply] [d/l]
Re: question about finding strings (regexes and slurping files) by Discipulus (Canon) on Nov 28, 2017 at 08:21 UTC
Hello SaraBetsy and welcome to the monastery and to the wonderful world of Perl! you already got some good advice, so I just want to clarify few things. > should grab the 25 characters before and after it it's not what the regex you posted is supposed to do: it grabs from 0 to 25 chars before and after the string. As already said `gsix` modifiers must go outside the regular expression: ' m/.../gsix' Let's use your regex to match 0-3 chars before and after the letter `X` using: `/.{0,3}X.{0,3}/` against some strings: `# regex /.{0,3}X.{0,3}/ # # string matched part 123X123 123X123 12X123 12X123 1X123 1X123 X123 X123 X123456 X123` [download] And now confront the different output of the `/.{3}X.{3}/` regex against the same set of strings: `# regex /.{3}X.{3}/ # # string matched part 123X123 123X123 12X123 -no match- 1X123 -no match- X123 -no match- X123456 -no match-` [download] Infact the second version search for at least 3 chars before and after `X` Now a little note about slurping files. When you do it the file goes deirectly into the memory, with probably even some overhead, so 100Mb of file data will be at least 100Mb+ of RAM used. As you will work as bioinformatic with possibly big files it's better to understand this early. If you process the file one line at time the memory consumption is minimal. The diamond operator `<>` is a poweful beast in Perl and, as many other things in perl, it acts differently depending on the context it was used in. `# open my $fh, '<', $file_path or die "unable to read $file_path" # list context: every line goes in the array my @all_lines = <$fh>; # scalar context: just next line goes into a scalar (<> acts as an ite +rator here) my $line = <$fh>; # so to read a file one line at time: while (defined( my $line= <$fh>)) {` [download] See How to read in large files L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: question about finding strings (regexes and slurping files) by ww (Archbishop) on Nov 28, 2017 at 16:18 UTC
First, EXCELLENT POINTS in Discipulus' reply above. Refresher: OP wants to capture a specific word and some words around it (no definition of why) in any lines of a moderately large text file which contain the specific word -- for some sort of corpus analysis. Now, for another approach to the regex, we can define `$word` (to capture the 'nearby words' OP wanted) in terms of alpha-content rather than by character-counting. (As done here, we must insert single spaces between the words in the regex itself but could have including most of them in the definition of `$word`): #!/usr/bin/perl use strict; use warnings; use 5.024; # 1204380 my $string = 'tryna'; my $word = qr /[a-z]+/i; # any word comprised solely of letters a-z, UC or LC, follo +wed by a space my (@slurp, $line, @found, $found); # declare vars; bad practice to do as globals, but simpler +to read @slurp = <DATA>; # read each line of __DATA__ into var $slurp; for $line(@slurp) { # read thru array @slurp line by line if ( $line =~ /($word\s$word\s$string\s$word\s$word\s)/gix ) { # Match only if there are two $word instances before $s +tring and a # space following the second $word after $string push @found, "\tmatch: $1"; # When Ln 16 matches Ln 17 pushes the match (+ a visual + marker) to @found print "full original line with a match: $line\n"; } } for $found(@found) { say $found; } __DATA__ 123 abcde this sentence has foo bar tryna much too long for my taste + CONTAINS MATCH this doesn't have the magic phrase 123456 7890 abcd3e fc. much too long for my taste but tryno tryna foo bar baz CONTAINS MAT +CH work was put into the tryna document which shows good work CONTAINS M +ATCH problems with our out of town and other tryna that never show up CON +TAINS MATCH Tryna fill to gully and TRYNA upside of big Pine CONTAINS MATCH TWICE + BUT ... ...FAILS ON Ln 15 BECUZ THERE IS NO $word NOR ANY SPACE ... ...PRECEEDING THE FIRST INSTANCE OF $string! no searchstring here endit [download] And here is the output (the full lines are redundant to OP's stated needs but are included for clarity): F:\PMonks\>1204380.pl full original line with a match: 123 abcde this sentence has foo bar +tryna much too long for my taste CONTAINS MATCH full original line with a match: much too long for my taste but tryno + tryna foo bar baz CONTAINS MATCH full original line with a match: work was put into the tryna document + which shows good work CONTAINS MATCH full original line with a match: problems with our out of town and ot +her tryna that never show up CONTAINS MATCH full original line with a match: Tryna fill to gully and TRYNA upside + of big Pine CONTAINS MATCH TWICE BUT ... match: foo bar tryna much too match: but tryno tryna foo bar match: into the tryna document which match: and other tryna that never match: gully and TRYNA upside of F:\PMonks> [download] Spirit of the Monastery	[reply] [d/l] [select]
Re: question about finding strings? by ww (Archbishop) on Nov 28, 2017 at 01:31 UTC
Since my chow break broke our conversation, for starters: I erred on the CB; I don't much care for the code you snagged from Q&A, except the regex, which may be useful to you if you have long lines. That said, the code below, modified from the same Q&A node (code example #4) will be a starting point for you. Please come back and edit your OP with a small sample of data (inside code tags). Since you didn't supply that, I've dummied up a small text and have used it in the `__DATA__` section below. That section is a stand-in for your datafile; you won't do it that way, given the size of your data files but as an example, it'll serve (I hope) If your post-division data files are something like 69MB (as I think you mentioned), you'll need a lot of RAM or you'll need to read in groups of lines (segments) or even single lines. Alternately Re: Array size too big? may spark some ideas abut the file splitting you've already done. Also tell us why the search term itself is inadequate; i.e., why you want some words around it. #!/usr/bin/perl use strict; # using strict and warnings will help you sp +ot typos and other guff use warnings; use 5.024; # 1204380 my $string = 'tryna'; # you didn't give us an example of the dat +a so using this my (@slurp, $line, @found); # declare vars; bad for to do as globals, +but simpler to read @slurp = <DATA>; # read each line of __DATA__ into +var $slurp; for $line(@slurp) { # read thru array @slurp line by l +ine if( $line =~ m/($string)/gs ) { # if the current line contains $ s +tring push @found, $line; # and push it to array, @found } } say @found; # ... whose elements get printed to console, + here. # you can redirect the scripts output to a fi +le or write a # few more lines here to have the script writ +e it to a file # or, of course, you can use any one of many +methods to # catch JUST the searchstring and surrounding + words (why?") # NB: this will cause a warning about an unitiialized var in Line 5; # simple enuf but not immediately an issue for OP and, IMO, adding ano +ther loop # will just be confusing at this time. __DATA__ 123456 7890 abcd3e fc this sentence has for bar tryna much too long fo +r my taste this doesn't have the magic phrase 123456 7890 abcd3e fc. much too long for my taste but tryno tryna foo bar baz bat bingo and h +as the magic phrase endit [download] Spirit of the Monastery	[reply] [d/l] [select]
Re: question about finding strings? by karlgoethebier (Abbot) on Nov 28, 2017 at 18:09 UTC
"...thoughts on how to test..." Consider Test::More: `#!/usr/bin/env perl use strict; use warnings; use Test::More tests => 3; use feature qw(say); my $string = q(Lorem ipsum kizuaheli); say $string; $string =~ m/^.+(.{3})(ipsum)(.{3}).+$/; ok( $1 eq q(em ), qq(\$1: $1) ); ok( $2 eq q(ipsum), qq(\$2: $2) ); ok( $3 eq q( ki), qq(\$3: $3) ); __END__ karls-mac-mini:monk karl$ ./tryna.pl 1..3 Lorem ipsum kizuaheli ok 1 - $1: em ok 2 - $2: ipsum ok 3 - $3: ki` [download] The more test-driven monks may forgive me. Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re: question about finding strings? by SaraBetsy (Initiate) on Dec 02, 2017 at 20:00 UTC
Oh. My. Goodness. Y'all are some awesome people! Thank you so much for all the assistance! It's going to take me a bit to sit down and go through all of this, but I'm definitely making learning more about perl as my winter-break project.	[reply]