Remove blank lines from REGEX output

Deep_Plaid has asked for the wisdom of the Perl Monks concerning the following question:

Please forgive my ignorance - I'm a bit rusty. I am using regex to parse a data file. I am taking an exclusion approach (i.e., removing lines I don't want), but no matter what I do, the output contains blank lines for the lines excluded. I have tried chomp (which gets rid of all eol, which I don't want) and trimming (both commented out below), but to no avail. My exclusion criteria is working, but that may be part of the problem. Here's the code:

use strict;
use diagnostics;
use warnings;

# Define label variable

my $build = $ARGV[0];

die "\nScript aborted. Study code is missing as an argument.\n\nUsage:
+ $0 {STUDYCODE}\n" if @ARGV == 0;
      
open (LABELS, 'labels.txt') || die "Can't open labels.txt \n$!\n";
while (<LABELS>)
    {
    my $data_line = $_;
    # 
    if ($data_line =~ /$build/)
        {
        # Remove lines beginning with "bl", "BL" or "_" and releases w
+ith a label with xx.yy.zzz.nnn(n)
        $data_line =~ s/^[BbLl].*|^_.*|^.*\d\d\d\.\d\d\d.*//g;
        #chomp ($data_line); #Remove end of lines
        #$data_line =~ tr/\n//s; # Still have blank lines I need to re
+move
        print "$data_line";    
        }
    }
close LABELS;
exit;
[download]

Here is the input date I have in the file labels.txt:

_STUDYABCD1234_1.00
_STUDYABCD1234_1.00.5678
STUDYABCD1234_1.00
STUDYABCD1234_1.00.000
p_STUDYABCD1234_1.00.000
p_STUDYABCD1234_1.00.000.5678
bl_STUDYABCD1234_1.00.000
bl_STUDYABCD1234_1.00.000.5678
BL_STUDYABCD1234_1.00.000
BL_STUDYABCD1234_1.00.000.5678
[download]

There are ten lines in the input file, and the output also contains 10 lines, three of which show properly, and 7 blank lines that are the excluded lines. Any help is appreciated. Thanks.

Comment on Remove blank lines from REGEX output Select or Download Code

Replies are listed 'Best First'.
Re: Remove blank lines from REGEX output by Kenosis (Priest) on Feb 05, 2014 at 20:40 UTC
If I'm understanding the specs in a comment and your regex, perhaps the following will be helpful: `use strict; use warnings; # Remove (exclude) lines beginning with "bl", "BL" or "_" and releases + with a label with xx.yy.zzz.nnn(n) while (<DATA>) { print unless /^(?:bl\|_)\|\.\d{3}\.\d{3,}$/i; } __DATA__ _STUDYABCD1234_1.00 _STUDYABCD1234_1.00.5678 STUDYABCD1234_1.00 STUDYABCD1234_1.00.000 p_STUDYABCD1234_1.00.000 p_STUDYABCD1234_1.00.000.5678 bl_STUDYABCD1234_1.00.000 bl_STUDYABCD1234_1.00.000.5678 BL_STUDYABCD1234_1.00.000 BL_STUDYABCD1234_1.00.000.5678` [download] Output: `STUDYABCD1234_1.00 STUDYABCD1234_1.00.000 p_STUDYABCD1234_1.00.000` [download] The regex: `/^(?:bl\|_)\|\.\d{3}\.\d{3,}$/i ^ ^ ^ ^ ^ ^ ^ ^ ^ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| + - Case-insensitive \| \| \| \| \| \| \| + - From the end of the string \| \| \| \| \| \| + - Three or more digits \| \| \| \| \| + - A decimal point \| \| \| \| + - Three digits \| \| \| + - A decimal point \| \| + - OR \| + - Match either "bl" or "_" + - From the beginning of the string` [download] The script just skips (excludes) those lines that you don't want, instead of making them blank.	[reply] [d/l] [select]
Re^2: Remove blank lines from REGEX output by Deep_Plaid (Acolyte) on Feb 05, 2014 at 21:08 UTC
Thanks for your fast and thoughtful reply, Kenosis. I really appreciate the way you broke down the regex line with an explanation - it was very helpful. I've added this code and it's working. I'm still not quite sure why my original attempts didn't work - I couldn't get them to work without .*, thinking I needed a wildcard to account for the rest of the string, but it makes sense that this approach was leaving in /n. Thanks again.	[reply]
Re^3: Remove blank lines from REGEX output by Kenosis (Priest) on Feb 05, 2014 at 21:16 UTC
You're most welcome, Deep_Plaid! Am glad it worked for you.	[reply]
Re: Remove blank lines from REGEX output by Anonymous Monk on Feb 05, 2014 at 20:35 UTC
The final `.` doesn't match the `\n` at the end of `$data_line`, so that doesn't get removed (perlre). One approach you can use is to add the `/s` modifier to the regex, so that `.` matches the newline. Obviously TMTOWTDI - personally I'd just make the printing of the lines conditional: `print $data_line unless $data_line=~/^[BbLl].\|^_.\|^.\d\d\d\.\d\d\d.*/;` [download]	[reply] [d/l] [select]
Re^2: Remove blank lines from REGEX output by Deep_Plaid (Acolyte) on Feb 05, 2014 at 21:02 UTC
Thanks for you help and fast reply! That does make sense.	[reply]
Re: Remove blank lines from REGEX output by sundialsvc4 (Abbot) on Feb 05, 2014 at 21:13 UTC
FYI, if you only want such a loop to consider certain lines within the file, a rather handy “Perl-ism” would be a line that looks like this: `next unless $data_line =~ ...whatever... ;` ... followed by whatever else you might need in the loop. The loop will immediately proceed to the `next` line, `unless` the line matches the particular condition, and thus you avoid having to write a rather large and bulky `if`-statement that encompasses the rest of the `while` block. Plus, it reads easily ... it’s a very human-natural way of saying it.
Re^2: Remove blank lines from REGEX output by Deep_Plaid (Acolyte) on Feb 05, 2014 at 21:32 UTC
Thanks, Sundial. I prefer more human readable approaches.	[reply]
Re^3: Remove blank lines from REGEX output by Laurent_R (Canon) on Feb 05, 2014 at 22:28 UTC
I think that you are wrong on that. The `next` approach is very human readable and and a very efficient way to build a decision tree in many cases. Suppose that you have a set of business rules specifying which lines of a file you want to process and which you want to discard. You can do it this way: `while (<$IN>) { chomp; next if /^#/; # discard line, it is a comment (starts + with #) next if /^\s$/; # discard line, contains only spaces next if length < $min_length; # line is too short next if /^REM/; # another form of comment next unless /^.{3}\d{4}/; # lines of interest have 4 digits from +position 4 to 7 # now the real processing ... }` [download] This is much cleaner and much more readable than a long series of nested if ... elsif ... elsif ... It is also often quite efficient, because as soon as you discard a line for one reason, none of the subsequent tests has to run (of course, it will be more efficient if you are able to put first the most common causes for exclusions and last the rare ones). There are other ways of achieving similar results. For example, you could have: `while (<$IN>) { chomp; next if /^#/ or /^\s$/ or length < $min_useful_length or /^REM/ + or not /^.{3}\d{4}/ ...` [download] This is more concise, and any condition evaluating to TRUE will also lead to short-circuiting the subsequent conditions, so that the performance will be similar, but that removes the opportunity to document the business rules that led to exclusion. I might use any of the two techniques, depending on the situation, but if the business rules are somewhat complicated or numerous, I prefer the first one.	[reply] [d/l] [select]
Re^4: Remove blank lines from REGEX output by AnomalousMonk (Archbishop) on Feb 06, 2014 at 00:22 UTC
Re^5: Remove blank lines from REGEX output by Laurent_R (Canon) on Feb 06, 2014 at 11:02 UTC
Some notes below your chosen depth have not been shown here


Pathologically Eclectic Rubbish Lister
	PerlMonks