Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Finding Line numbers in a file

by kyle (Abbot)
on Apr 04, 2007 at 14:52 UTC ( [id://608292]=note: print w/replies, xml ) Need Help??


in reply to Finding Line numbers in a file

According to my Camel, "each time a pattern successfully matches (including the pattern in a substitution), it sets the $`, $&, and $' variables to the text left of the match, the whole match, and the text right of the match."

That sounds useful.

my $text = <<'END_OF_TEXT'; line 1 apple banana line 2 line cherry 3 END_OF_TEXT ; while ( $text =~ m/(apple|banana|cherry)/ig ) { my $word = $1; my $prelines = ( $` =~ tr/\n// ); printf qq{Word "%s" found on line %d\n}, $word, $prelines + 1; } __END__ Word "apple" found on line 1 Word "banana" found on line 2 Word "cherry" found on line 3

If you use English, the $` variable is called $PREMATCH (see perlvar, which notes that using this variable "imposes a considerable performance penalty on all regular expression matches").

Replies are listed 'Best First'.
Re^2: Finding Line numbers in a file
by reasonablekeith (Deacon) on Apr 04, 2007 at 15:29 UTC
    Nice, but inefficient, and gets worse the bigger the text file is.

    Do not do this, use the others, they increase in a linear proportion with the size of the text file, and do not require entire file to be loaded into memory.

    ---
    my name's not Keith, and I'm not reasonable.
      You are possibly right. You are just as possibly wrong. There are several things that we don't know, such as:
      • Average line length. Shorter lines means more lowlevel iterations.
      • Average file length. Longer files will require more memory - but that is about all.
      • Average hit count. How often is the string found in the file.
      • Average hit placement. How often does the string end up at the beginning or the end.
      • Implementation issues. Is the string passed in already in one chunk or do we have access to a file handle.
      There are just too many unknowns to use blanket statements as to which algorithm is best.

      But one thing that is a major issue is that the special regex capture variables shouldn't be used. They impose too much penalty. Instead though you can use @- and @+ which have no penalty. As in the following:

      my $str = "1 one 2 two 3 one 4 four 5 one 6 five"; my $last_pos = 0; my $newlines = 1; while ($str =~ /(one)/g) { $newlines += substr($str, $last_pos, $-[0] - $last_pos) =~ tr/\n// +; $last_pos = $-[0]; print "Found on line $newlines\n"; } # prints # Found on line 1 # Found on line 3 # Found on line 5


      Notice the optimization that only counts newlines from the previous match.

      my @a=qw(random brilliant braindead); print $a[rand(@a)];
      Dear kyle and reasonablekeith,
      Thanks for suggestion and warning also. This is making me think in new directions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://608292]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-18 18:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found