fieroboom has asked for the wisdom of the Perl Monks concerning the following question:

Hello wonderful monks, I've gleaned LOADS of great information here on my journey of Perl, but I have a question of my own now... I have a script that (among many other things) reads in a file with a list of words, then creates a regex string from those words. Here is the code I have now, which is working fine, but I wonder if there's a more direct way to do this, rather then filehandle -> array -> join & map array to scalar...

my $blacklist_file = 'BlacklistWords.txt'; open(BLIST, "<$blacklist_file") or die("Can't open $blacklist_file for + reading!!\n\n"); my @blacklist_words = <BLIST>; close(BLIST); chomp(@blacklist_words); my $blacklist_regex = join"|" => map {"(?:$_)"} @blacklist_words; # Cr +eate a regex from blacklisted words print "blacklist regex:\n$blacklist_regex\n\n"; exit;

Here is an example of the regex string I'm after:

blacklist regex: (?:LOL)|(?:XviD-RUBY)|(?:WEB-DL)|(?:H264)|(?:BluRay)|(?:x264)|(?:YIFY) +|(?:DVDRip)|(?:MP3)|(?:ENG)|(?:DvDripaXXo)|(?:BRRiP)|(?:XviD)|(?:AbSu +rdiTy)|(?:WEBRip)|(?:XviDETRG)|(?:XviD-ILLUMINATI)|(?:XviDExtraTorren +tRG)|(?:AC3-3LT0N)|(?:XViD-PLAYNOW)|(?:XVIDSSB)|(?:XViD-SSB)|(?:BDRip +)|(?:XviD-3LT0N)|(?:KillerRG)|(?:XviD-AMIABLE)|(?:x264-AVS720)|(?:Xvi +D-NEUTRINO)|(?:3Li)|(?:DTS)|(?:x2643Li)|(?:GAZ)|(?:XviD-AWESOMENESS)| +(?:XviDSCREAM)|(?:UnKnOwN)|(?:DVDRip_XviD)|(?:AZnTX)|(?:HDTV)|(?:x264 +LOL)|(?:ettv)|(?:R5)|(?:x264-LOL)|(?:PROPER)|(?:x264-2HD)|(?:XviD-AFG +)|(?:x264-mSD)|(?:P2PDL)|(?:x264-DHD)|(?:PublicHD)|(?:x264-MiNDTHEGAP +)|(?:hdtv-lol)|(?:xvid-xor)|(?:psychodrama)|(?:hdtv_xvid-fov)|(?:repa +ck-lol)|(?:rerip)|(?:xvid-ctu)|(?:Lo-Fi)|(?:X264-DIMENSION)|(?:_evid) +|(?:TorrentDay)|(?:XviD-MOMENTUM)

Basically just a list of non-capturing groups. Of course, I suppose I could make it a single non-capturing group for a little more efficiency, but that's another subject... Anyway, the question is, am I doing this the most PERLitically correct way, or is there a better way to go from <BLIST> to $blacklist_regex? Thanks so much!

Replies are listed 'Best First'.
Re: Best approach to creating a regex from a filehandle
by choroba (Cardinal) on May 18, 2014 at 19:15 UTC
    You can probably avoid map by using
    '(?:' . join(')|(?:', @blacklist_words) . ')'

    But, if the "words" can contain non-alphabetical characters with special meaning in regexes, you might need to map quotemeta to each word.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Best approach to creating a regex from a filehandle
by NetWallah (Canon) on May 18, 2014 at 21:06 UTC
    Re interpreting " is there a better way ..."

    If all you are checking for is absence of a word in a black list, i'd suggest putting the black-listed words into a hash, and simply checking :

    if ( exists $Black_List{$candidate_word} ){ # complain, bail, or whatever ... }
    You could upper/lower case the candidate word to maintain canonality.

            What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
                  -Larry Wall, 1992

Re: Best approach to creating a regex from a filehandle
by toolic (Bishop) on May 18, 2014 at 19:15 UTC
    There is no need for the array:
    my $blacklist_regex = join "|" => map {"(?:$_)"} map { chomp; $_ } <BL +IST>;

      Unless you're sure that the input file will only contain safe characters, you should call quotemeta on $_ inside the map.
      Also, what's the benefit of chaining two map's like that, instead of combining them into one?

      Edit: Oops, I only just now noticed that choroba already mentioned quotemeta in his answer. Sorry for the redundancy.

        You're right... it can be simplified using a single map
        my $blacklist_regex = join "|" => map { chomp; "(?:$_)"} <BLIST>;
Re: Best approach to creating a regex from a filehandle
by fieroboom (Novice) on May 21, 2014 at 12:12 UTC

    Perfect, I knew there was a simpler way! Toolic, I really like your second example; elegant, but still readable (at least in my mind, anyway). Thanks so much guys!

    EDIT: By the way, this is my first post here; if it's necessary for me to somehow mark this as "solved", I'll be happy to do so.