Perl - Remove duplicate based on substring and check on delimiters

bopibopi has asked for the wisdom of the Perl Monks concerning the following question:

Hello, i have the following input file :

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
[download]

Delimited with x, and the column of interest is 0-7, see below. I m trying to write a script that reads each line, checks the amount of x's and compares it against a number, if the amount if != the set number then i want the output into fh1 (output.control). Then it ll check a specific substring on each line, and print only the first encountered. (Remove duplicates but maintain order)

The code i have so far is

#!/usr/bin/perl

use strict;
# use warnings qw/ all FATAL /;

my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;



open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");

while ( <> ) {
   
    my $count = ($_ =~ y/x//);
    print  "$count \n";
    # print $_;
    
    if ( $count != $delim_amnt_per_line )
    {
        print fh1 $_;
    }
    

    my ($prefix) = substr $_, 0, 7;
    next if $seen{$prefix}++;

    print $fh2;
}
[download]

My problem is that it doesnt print anything on either filename, whereas if it was just print and i had redirected the script from the command line, it would output as normal. Can someone help me?

EDIT : I think i ve located the problem. Neither of the produced files had write permission. They were just set on read, is there a way to change this from inside the code?

Comment on Perl - Remove duplicate based on substring and check on delimiters Select or Download Code

Replies are listed 'Best First'.
Re: Perl - Remove duplicate based on substring and check on delimiters by haukex (Archbishop) on May 18, 2016 at 21:18 UTC
Hi bopibopi, Your code has a couple of issues: You don't check your open calls for errors (`open(...) or die $!;`), you seem to have a typo on the line `print fh1 $_;` (should be `$fh1`), and ~~there's a closing brace missing (a copy/paste mistake I assume)~~ (apparently fixed by ninja edit... It is uncool to update a node in a way that renders replies confusing or meaningless). Also, `print $fh2;` prints the filehandle to standard output, if you want to print the current line to `$fh2` you have to be explicit: `print $fh2 $_;` They were just set on read, is there a way to change this from inside the code? I'd recommend you don't, because write-protection is supposed to be exactly that! Someone someday (including you) might set a file to read-only for a good reason and the script would clobber it anyways. I recommend you output a descriptive error message instead, e.g. `die "I can't write to $filename\n" unless -w $filename;` (see -X). But if you must ("just enough rope" and all that), there's chmod. Hope this helps, -- Hauke D P.S. Just saw stevieb and linuxer were a little faster than me on several points :-)	[reply] [d/l] [select]
Re: Perl - Remove duplicate based on substring and check on delimiters by stevieb (Canon) on May 18, 2016 at 21:07 UTC
Always, always check to ensure your file actually opened properly: `open my $fh1, ">>", "outputcontrol.txt" or die $!; open my $fh2, ">>", "outputoutput.txt" or die $!;` [download] I don't know if that's the issue, but it's definitely the first thing to try.	[reply] [d/l]
Re: Perl - Remove duplicate based on substring and check on delimiters by linuxer (Curate) on May 18, 2016 at 21:08 UTC
You can use chmod to change file's permissions if you have sufficient permissions to chmod the file. You should check open's success, so you know directly if open was successful or not and you can behave accordingly. `open(my $handle, '>>', $filename) or die "open($filename,w+) failed: $ +!";` [download] edit: fixed typo	[reply] [d/l]
Re: Perl - Remove duplicate based on substring and check on delimiters by Marshall (Canon) on May 18, 2016 at 23:39 UTC
Another way without using substr (which is actually seldom used in Perl) is to use split, like a simple CSV file would be parsed, except with 'x' instead of ','. #!usr/bin/perl use warnings; use strict; use Data::Dumper; while (my $line =<DATA>) { chomp $line; print "line = $line\n"; my $tokens =(my $first, my @rest)= split 'x',$line,-1; print "num tokens is: $tokens\n"; print Dumper $first, \@rest; print "\n"; } =prints line = 1212123x534534534534xx4545454x232322xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232322', '', '' ]; line = 0901001x876879878787xx0909918x212245xx num tokens is: 7 $VAR1 = '0901001'; $VAR2 = [ '876879878787', '', '0909918', '212245', '', '' ]; line = 1212123x534534534534xx4545454x232323xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232323', '', '' ]; line = 1212133x534534534534xx4549454x232322xx num tokens is: 7 $VAR1 = '1212133'; $VAR2 = [ '534534534534', '', '4549454', '232322', '', '' ]; line = 4352342xx23232xxx345545x45454x23232xxx num tokens is: 11 $VAR1 = '4352342'; $VAR2 = [ '', '23232', '', '', '345545', '45454', '23232', '', '', '' ]; =cut __DATA__ 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx [download]	[reply] [d/l]
Re^2: Perl - Remove duplicate based on substring and check on delimiters by AnomalousMonk (Archbishop) on May 19, 2016 at 01:54 UTC
That gives an off-by-one `$tokens` value (it's actually counting the stuff "around" the tokens (update: and it requires creation of an otherwise unused array to hold most of that stuff)), but that's easy to fix: c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $t = 'x'; ;; for my $line (qw( 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx )) { my $tokens = my ($first, @rest) = split $t, $line, -1; $tokens -= 1; print qq{'$line': num '$t' tokens is: $tokens}; dd ($first, \@rest); } " '1212123x534534534534xx4545454x232322xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232322, "", ""]) '0901001x876879878787xx0909918x212245xx': num 'x' tokens is: 6 ("0901001", [876879878787, "", "0909918", 212245, "", ""]) '1212123x534534534534xx4545454x232323xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232323, "", ""]) '1212133x534534534534xx4549454x232322xx': num 'x' tokens is: 6 (1212133, [534534534534, "", 4549454, 232322, "", ""]) '4352342xx23232xxx345545x45454x23232xxx': num 'x' tokens is: 10 ( 4352342, ["", 23232, "", "", 345545, 45454, 23232, "", "", ""], ) [download] (But I don't really see anything wrong with using good old `tr///` for counting and poor old substr for fixed-field extraction.) Update: This gets rid of `@rest` and the `$tokens -= 1;` statement for all you one-liner addicts out there: `my $tokens = (my ($first) = split $t, $line, -1) - 1;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Perl - Remove duplicate based on substring and check on delimiters by Marshall (Canon) on May 19, 2016 at 02:52 UTC
I think we are splitting hairs here. I count $first as the first token, you don't. Or you figure that the final empty token shouldn't be counted? Either way not a significant problem in my mind. Yes, tr is the fastest and best way to do a simple count of the x's. And yes, substr is the fastest way to get a fixed length thing at the beginning. The reason that I demo'd split was to show: a)how to get a non-fixed length thing at the beginning, b)how to access some of these other length "between the x's" fields. I'm sure that they have some meaning. Update: I almost never use the -1 limit on split. I saw an opportunity to play with this and remind myself of how it worked. Once I had done that, I impulsively posted my "play". Wasn't meant to be "earth shattering" stuff, just an example of a not so common usage that is often forgotten.	[reply]
Re^4: Perl - Remove duplicate based on substring and check on delimiters by AnomalousMonk (Archbishop) on May 19, 2016 at 03:16 UTC
Re^2: Perl - Remove duplicate based on substring and check on delimiters by johngg (Canon) on May 19, 2016 at 11:17 UTC
without using substr (which is actually seldom used in Perl) Surely, you jest?!?! Cheers, JohnGG	[reply]
Re^3: Perl - Remove duplicate based on substring and check on delimiters by Marshall (Canon) on May 19, 2016 at 23:27 UTC
Sorry for the controversy - not my intent. I should've said something different or omitted that entirely. I use Perl often to process all kinds of text reports. By far and away, the most common tools that I use are: a)split and b)match global combined with c) regex. In my typical application, speed doesn't matter, but flexibility does. It is very seldom that I encounter a fixed column report where substr would be appropriate. That doesn't mean that I don't use substr, just that in my personal experience, with the types of text reports that I process, it doesn't come up. Mileage Varies! Processing a binary header, say like that found in a .WAV file is a whole different critter, substr is definately the right tool for that job. I am talking about text reports. Just yesterday, a file that I've been processing since 2011 changed its format. Oops. The same info is there, but it got moved around. The 2016 format is different and I have no control over that change. But this change was easy for me to adapt to and was something like this: `(split ' ',$line)[1,7,3] to (split ' ',$line)[1,4,-2]`. If I had used substr(), then this would have been a bigger deal. Changing something that has been working for 5 years comes up all the time. Such is the nature of using ad hoc methods to parse reports that you have no control over.	[reply] [d/l]

Back to Seekers of Perl Wisdom