Nico has asked for the wisdom of the Perl Monks concerning the following question:

Hello again everyone!

I always end up finding myself coming straight back to PerlMonks when I'm stumped because everyone is always so helpful!

Right now I'm attempting to analyze a text file. This text file (I will give an example) with multiple "sections" divided by "\n" characters. These sections contain a bunch of data I don't care about, but there a few lines that I need to grab. Here is where my problem lies.

Example Text File

First Name: John Last Name: Doe Occupation: Network Administrator Location: West Coast First Name: Jane Last Name: Doe Occupation: Human Resources Location: East Coast First Name: James Last Name: Doe Occupation: Technical Support Engineer Location: Central USA

I have been trying to use regex to search for a string, for example "Central USA" and then use that to match the "First Name" and "Last Name" lines and CAPTURE their names.

I attempted to use a regex "lookbehind" but I can't do that since my capture has to be variable in length. I believe this is because I don't know the length of the first or last name and I have to account for that. I have been attempting to use http://regexstorm.net/tester to accomplish this.

When I don't use a lookbehind, my regex search picks up the first "First Name" and "Last Name" line in the file regardless of if it is near where I matched the "Location" field. This makes sense, but I want it to grab the "First Name" and "Last Name" line that came right before "Central USA".

Should I be going at this a different way?

Example Code

if ($line =~ /First Name:\s+([A-Za-z0-9 _ ( )]*).*?Last Name:\s+([A-Za +-z0-9 _ ( )]*).*?Location: Central USA/s) { print $line; }

As always, any help would be greatly appreciated!

Replies are listed 'Best First'.
Re: Storing String from Line Before Regex Match
by toolic (Bishop) on Mar 31, 2016 at 18:32 UTC
    A different approach is to read the file as records separated by a blank line. Store the data into a hash for each record, then print out only what you need. One benefit is that this method is independent of the order of the lines of the input.
    use warnings; use strict; $/ = "\n\n"; while (<DATA>) { my %data; for my $line (split /\n/) { my ($k, $v) = split /\s*:\s*/, $line; $data{$k} = $v; } print "$data{'First Name'} $data{'Last Name'}\n" if $data{Location +} eq 'Central USA'; } __DATA__ First Name: John Last Name: Doe Occupation: Network Administrator Location: West Coast First Name: Jane Last Name: Doe Occupation: Human Resources Location: East Coast First Name: James Last Name: Doe Occupation: Technical Support Engineer Location: Central USA
Re: Storing String from Line Before Regex Match
by haukex (Archbishop) on Mar 31, 2016 at 19:23 UTC

    Hi Nico,

    I like toolic's approach better, but in the spirit of TIMTOWTDI:

    my $re = qr/ First\ Name: \s+ (.+)\n Last\ Name: \s+ (.+)\n (?:.+\n)* Location:\ Central\ USA\n /x; while ($line=~/$re/g) { print "<$1> <$2>\n"; }

    Note the use of the /x modifier to make the regex more readable. Also, I removed the /s modifier, so that the dot . doesn't match newlines.

    I think this approach is probably a little less robust than splitting the input file on empty lines, but if you're sure of the formatting of the input files this should still work.

    Hope that helps,
    -- Hauke D

Re: Storing String from Line Before Regex Match
by Laurent_R (Canon) on Apr 01, 2016 at 06:35 UTC
    You might simply read your data in a loop, capture the names as you go, and use these captures only when needed (when location matches "Central USA").
Re: Storing String from Line Before Regex Match
by tybalt89 (Monsignor) on Apr 01, 2017 at 20:57 UTC

    The .* lets your match cross over section boundaries.
    $stayinsection acts just like .* but will not allow crossing over your section boundary of "\n\n"
    Just a slightly advanced regex trick :)

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1159215 use strict; use warnings; my $stayinsection = qr/(?:(?!\n\n).)*/s; $_ = do { local $/; <DATA> }; print "<$1> <$2>\n" while /First Name:\s+([A-Za-z0-9 _ ( )]*)${stayinsection}Last Name:\s+([A- +Za-z0-9 _ ( )]*)${stayinsection}Location: Central USA/g; __DATA__ First Name: John Last Name: Doe Occupation: Network Administrator Location: West Coast First Name: Jane Last Name: Doe Occupation: Human Resources Location: East Coast First Name: James Last Name: Doe Occupation: Technical Support Engineer Location: Central USA First Name: Jane Last Name: Doe Occupation: Human Resources Location: East Coast First Name: Another Last Name: Doe Occupation: Technical Support Engineer Location: Central USA

      I notice the character class  [A-Za-z0-9 _ ( )] with two extra space characters. Just out of idle curiosity, is this done to enhance visual presentation/readability, or for some other reason?

      Update: Also: What is the purpose of the  $searchfor string?


      Give a man a fish:  <%-{-{-{-<

        I just left the OP's character class code as it was. I don't know why he had multiple spaces in there.

        I had tried several different solutions before deciding to just modify the OP's regex. The $searchfor string is left over from an early test version, it should be removed. In fact, I think I'll go do that now. Thanks for the catch.