was6guy has asked for the wisdom of the Perl Monks concerning the following question:

I need some help pulling an occurance of a sting out of a file. This sting can happen at different positions in the file, and it does not occur a set amount of times. I'ld like to be able to pull the sting out of the file, and save it to another file, also ignoring duplcate occurances of the string. Here's what the string looks like:

authDataAlias="cell-tstc-65_DM/userQ"

Sring will always be: authDataAlias="*-*-*_DM/*

Here's an example of the line containing the string:

<factories xmi:type="resources.jdbc:CMPConnectorFactory" xmi:id="CMPConnectorFactory_1195273978412" name="dataSource" authMechanismPreference="BASIC_PASSWORD" authDataAlias="cell-tstc-65_DM/userQ" connectionDefinition="ConnectionDefinition_1054132487569" cmpDatasource="DataSource_1195273954323">

An easy example, I can pull the line containing the string. If I knew where the sring would be each time in the line I could grab it, but since it's random... i'm lost. Could someone help me expand on this:
#!/usr/bin/perl my $data_file = '/home/resources.xml'; my $data_out = '/home/out.log'; open DATA, "$data_file" or die "can't open $data_file $!"; open DATA_OUT, ">>$data_out"; my @array_of_data = <DATA>; foreach my $line (@array_of_data) { if ($line =~ m/authDataAlias=.*-.*-.*_DM/i) { print DATA_OUT "$line\n"; } } close (DATA); close (DATA_OUT);

Replies are listed 'Best First'.
Re: Need Some help with finding a word in a file
by jrsimmon (Hermit) on Nov 28, 2007 at 23:57 UTC
    You're 99% of the way there already! Use () and $1 in your match to pull out the data you need. Ex:
    #!/usr/bin/perl my $data_file = '/home/resources.xml'; my $data_out = '/home/out.log'; open DATA, "$data_file" or die "can't open $data_file $!"; open DATA_OUT, ">>$data_out"; my @array_of_data = <DATA>; my $match; foreach my $line (@array_of_data) { if ($line =~ m/authDataAlias=(.*-.*-.*_DM)/i) { $match = $1; print DATA_OUT "$line\n"; } } close (DATA); close (DATA_OUT);
    The special variables $1, $2, etc are set to the data inside of parens when you use parens to encapsulate part of your regex. So $1 matches the first (...), $2 the second, and so forth.
      Thank you so much. Does anyone know how to ignore duplicates?
        Can you be a little more specific about which duplicates you wish to ignore? Do you expect to find duplicate words within the same line? Within the file but not within a single line?
        I can deal with the duplicates, not that big of a deal, but I think I need help with my regex, some of the stings have a null value, and some have a value, I need the ones that look like this:

        authDataAlias="cell-tstc-65_DM/userQ"

        I'm only concerned with: cell-tstc-65_DM/userQ

        If I do this, sed returns a blank line in the file since one of the authDataAlias strings is set to ="":
        ($line =~ m/authDataAlias=\"([^\"]*)\"/i) cell-tstc-65_DM/userQ cell-tstc-65_DM/user1
        If I run this sed command, it ignores the empty sting, but returns too much of the line:
        ($line =~ m/authDataAlias=(.*-.*-.*_DM\/.*)/i) "cell-tstc-65_DM/userQ" connectionDefinition="ConnectionDefinition_105 +4132487569" cmpDatasource="DataSource_1195273954323"> "cell-tstc-65_DM/user1" relationalResourceAdapter="builtin_rra" statem +entCacheSize="10" datasourceHelperClassname="com.ibm.websphere.rsadap +ter.DB2UniversalDataStoreHelper">
Re: Need Some help with finding a word in a file
by thundergnat (Deacon) on Nov 29, 2007 at 20:54 UTC

    If you know what is directly before the information you need, try changing the input record separator. Any time you think "unique", you most likely will want a hash.

    #!/usr/bin/perl use warnings; use strict; $/ = 'authDataAlias='; my %no_dupes; foreach my $line (<DATA>) { if ($line =~ m/^"(.*?_DM\S+)"/i) { $no_dupes{$1} = 0; } } print "$_\n" for keys %no_dupes; __DATA__ <factories xmi:type="resources.jdbc:CMPConnectorFactory" xmi:id="CMPCo +nnectorFactory_1195273978412" name="dataSource" authMechanismPreferen +ce="BASIC_PASSWORD" authDataAlias="cell-tstc-65_DM/userQ" connectionD +efinition="ConnectionDefinition_1054132487569" cmpDatasource="DataSou +rce_1195273954323"><factories xmi:type="resources.jdbc:CMPConnectorFa +ctory" xmi:id="CMPConnectorFactory_1195273978412" name="dataSource" a +uthMechanismPreference="BASIC_PASSWORD" authDataAlias="cell-tstc-65_D +M/userQ" connectionDefinition="ConnectionDefinition_1054132487569" cm +pDatasource="DataSource_1195273954323"><factories xmi:type="resources +.jdbc:CMPConnectorFactory" xmi:id="CMPConnectorFactory_1195273978412" + name="dataSource" authMechanismPreference="BASIC_PASSWORD" authDataA +lias="cell-tstc-65_DM/userF" connectionDefinition="ConnectionDefiniti +on_1054132487569" cmpDatasource="DataSource_1195273954323"> <factories xmi:type="resources.jdbc:CMPConnectorFactory" xmi:id="CMPCo +nnectorFactory_1195273978412" name="dataSource" authMechanismPreferen +ce="BASIC_PASSWORD" authDataAlias="node-tstc-65_DM/userF" connectionD +efinition="ConnectionDefinition_1054132487569" cmpDatasource="DataSou +rce_1195273954323">
      I think I need help with my regex, some of the stings have a null value, and some have a value, I need the ones that look like this:

      authDataAlias="cell-tstc-65_DM/userQ"

      I'm only concerned with: cell-tstc-65_DM/userQ

      If I do this, sed returns a blank line in the file since one of the authDataAlias strings is set to ="":
      ($line =~ m/authDataAlias=\"([^\"]*)\"/i) cell-tstc-65_DM/userQ cell-tstc-65_DM/user1
      If I run this sed command, it ignores the empty sting, but returns too much of the line:
      ($line =~ m/authDataAlias=(.*-.*-.*_DM\/.*)/i) "cell-tstc-65_DM/userQ" connectionDefinition="ConnectionDefinition_105 +4132487569" cmpDatasource="DataSource_1195273954323"> "cell-tstc-65_DM/user1" relationalResourceAdapter="builtin_rra" statem +entCacheSize="10" datasourceHelperClassname="com.ibm.websphere.rsadap +ter.DB2UniversalDataStoreHelper">
        Your above example works perfect. THANK YOU!