in reply to Perl - Remove duplicate based on substring and check on delimiters
Another way without using substr (which is actually seldom used in Perl) is to use split, like a simple CSV file would be parsed, except with 'x' instead of ','.
#!usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
while (my $line =<DATA>)
{
chomp $line;
print "line = $line\n";
my $tokens =(my $first, my @rest)= split 'x',$line,-1;
print "num tokens is: $tokens\n";
print Dumper $first, \@rest;
print "\n";
}
=prints
line = 1212123x534534534534xx4545454x232322xx
num tokens is: 7
$VAR1 = '1212123';
$VAR2 = [
'534534534534',
'',
'4545454',
'232322',
'',
''
];
line = 0901001x876879878787xx0909918x212245xx
num tokens is: 7
$VAR1 = '0901001';
$VAR2 = [
'876879878787',
'',
'0909918',
'212245',
'',
''
];
line = 1212123x534534534534xx4545454x232323xx
num tokens is: 7
$VAR1 = '1212123';
$VAR2 = [
'534534534534',
'',
'4545454',
'232323',
'',
''
];
line = 1212133x534534534534xx4549454x232322xx
num tokens is: 7
$VAR1 = '1212133';
$VAR2 = [
'534534534534',
'',
'4549454',
'232322',
'',
''
];
line = 4352342xx23232xxx345545x45454x23232xxx
num tokens is: 11
$VAR1 = '4352342';
$VAR2 = [
'',
'23232',
'',
'',
'345545',
'45454',
'23232',
'',
'',
''
];
=cut
__DATA__
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
Re^2: Perl - Remove duplicate based on substring and check on delimiters
by AnomalousMonk (Archbishop) on May 19, 2016 at 01:54 UTC
|
That gives an off-by-one $tokens value (it's actually counting the stuff "around" the tokens (update: and it requires creation of an otherwise unused array to hold most of that stuff)), but that's easy to fix:
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my $t = 'x';
;;
for my $line (qw(
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
)) {
my $tokens = my ($first, @rest) = split $t, $line, -1;
$tokens -= 1;
print qq{'$line': num '$t' tokens is: $tokens};
dd ($first, \@rest);
}
"
'1212123x534534534534xx4545454x232322xx': num 'x' tokens is: 6
(1212123, [534534534534, "", 4545454, 232322, "", ""])
'0901001x876879878787xx0909918x212245xx': num 'x' tokens is: 6
("0901001", [876879878787, "", "0909918", 212245, "", ""])
'1212123x534534534534xx4545454x232323xx': num 'x' tokens is: 6
(1212123, [534534534534, "", 4545454, 232323, "", ""])
'1212133x534534534534xx4549454x232322xx': num 'x' tokens is: 6
(1212133, [534534534534, "", 4549454, 232322, "", ""])
'4352342xx23232xxx345545x45454x23232xxx': num 'x' tokens is: 10
(
4352342,
["", 23232, "", "", 345545, 45454, 23232, "", "", ""],
)
(But I don't really see anything wrong with using good old tr/// for counting and poor old substr for fixed-field extraction.)
Update: This gets rid of @rest and the $tokens -= 1; statement for all you one-liner addicts out there:
my $tokens = (my ($first) = split $t, $line, -1) - 1;
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I think we are splitting hairs here. I count $first as the first token, you don't. Or you figure that the final empty token shouldn't be counted? Either way not a significant problem in my mind.
Yes, tr is the fastest and best way to do a simple count of the x's. And yes, substr is the fastest way to get a fixed length thing at the beginning. The reason that I demo'd split was to show: a)how to get a non-fixed length thing at the beginning, b)how to access some of these other length "between the x's" fields. I'm sure that they have some meaning.
Update: I almost never use the -1 limit on split. I saw an opportunity to play with this and remind myself of how it worked. Once I had done that, I impulsively posted my "play". Wasn't meant to be "earth shattering" stuff, just an example of a not so common usage that is often forgotten.
| [reply] [Watch: Dir/Any] |
|
... do a simple count of the x's. ... get a fixed length thing at the beginning.
But I understood that to be what the OP was asking for, at least as a starting point for a larger application. (bopibopi actually seemed to have the counting and extracting part under control, and was asking for help with the subsequent pieces.) Using split may be a good example of doing something slightly different. We may not be so much splitting hairs here as comparing apples and oranges — or perhaps tangerines and oranges since we're not really all that far apart.
Update:
... the -1 limit on split ... an example of a not so common usage ...
As someone addicted to "not so common usage" myself, I can sympathize. (But I have it under control; I haven't used uncommonly in ages!)
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] |
Re^2: Perl - Remove duplicate based on substring and check on delimiters
by johngg (Canon) on May 19, 2016 at 11:17 UTC
|
without using substr (which is actually seldom used in Perl)
Surely, you jest?!?!
| [reply] [Watch: Dir/Any] |
|
Sorry for the controversy - not my intent. I should've said something
different or omitted that entirely.
I use Perl often to process all kinds of text reports. By far and away,
the most common tools that I use are: a)split and b)match global combined with
c) regex. In my typical application, speed doesn't matter, but flexibility does.
It is very seldom that I encounter a fixed column report where substr would
be appropriate.
That doesn't mean that I don't use substr, just that
in my personal experience, with the types of text reports that I process,
it doesn't come up. Mileage Varies! Processing a binary header, say like
that found in a .WAV file is a whole different critter, substr is definately
the right tool for that job. I am talking about text reports.
Just yesterday, a file that I've been processing since 2011 changed its format.
Oops. The same info is there, but it got moved around. The 2016 format
is different and I have no control over that change. But this change was
easy for me to adapt to and was something like this:
(split ' ',$line)[1,7,3] to (split ' ',$line)[1,4,-2]. If I had
used substr(), then this would have been a bigger deal. Changing something
that has been working for 5 years comes up all the time. Such is the nature
of using ad hoc methods to parse reports that you have no control over.
| [reply] [Watch: Dir/Any] [d/l] |
|
|