Ovid has asked for the wisdom of the Perl Monks concerning the following question:
People often tell me that Perl resembles line noise and they don't want to use it as a result. I've written a small script to analyze some SQL statements and I've written "line noise" and am soliciting advice on how to clean the code.
I have a sample of SQL from the MS-SQL profiler. I need to identify all unique "INSERT INTO Photo..." statements. My thought was to read the file, select lines that have the approriate sql and break them into a three part array with the middle element containing the field names. Then I push a reference to that array onto another array. Next, I sort that array on the field names and print unique SQL inserts (tracking them by stuffing the first instance of each combination of field names in a hash). Here's the code:
use warnings;
use strict;
my ( @results, %fields );
while(<>){
s/\0//g;
if (/(?<!from )INSERT INTO Photo/) {
my @s = ( /([^(]+)(\([^)]+\))(.*)/ );
push @results, \@s;
}
}
my @final = sort{ $a->[1] cmp $b->[1] } @results;
open OUT, ">results.txt" or die $!;
foreach( @final ) {
my $fieldNames = $_->[1];
if ( ! exists $fields{ $fieldNames } ) {
$fields{ $fieldNames } = 1;
print OUT @{ $_ }
}
}
close OUT;
Here's some sample input (heavily edited because the SQL statements are HUGE):
exec new_int_id 'documents', 'did'
go
INSERT documents ( did, DocID, Template ) VALUES ( 3093, '000000000000
+000000003093', 'Photo' )
go
SELECT * FROM documents WHERE did=3093
go
INSERT INTO Photo (AccessLevelID, ANRNumber, ColorID, ContactSheetID)
+VALUES (1, '225058', 1, 4)
go
exec new_int_id 'Photo', 'PhotoID'
go
exec new_str_id 'documents', 'DocId'
go
INSERT INTO Photo (QualityID, Remark, UniqueID, VolNum) VALUES (1, 'St
+. Nicolaasfeest -v. Moorsel', '211357', '1')
go
My code works fine, but it's a great example of the "line noise" problem. Unfortuanately, I really can't work my mind around the problem of cleaning the code to the point where it's comprehensible to someone with a low level of Perl knowledge. I could break the regex out with the /x modifier, but that's only a partial solution. Any advice would be appreciated.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
Re: Trying to avoid line noise (brain cramp)
by Tyke (Pilgrim) on Mar 09, 2001 at 17:18 UTC
|
This isn't going to be much help, but regular
expressions are always going to look like line
noise to someone who doesn't grok them.
If you're writing for someone with little Perl
knowledge, then all you can do is
- Write clean tight code. (which you have)
- Document it (which you were about to), even
if it's on a line by line basis.
_rant_on_
I'm personally getting to the stage where I am
p**d off about pandering to developpers who want to
use a programming language without making an effort
to learn, not just the syntax, but also the idioms.
If regular expressions are part of the language, then
use them as they were meant to be used. If conditional
statement modifiers are there, then what's the
point of pretending that they aren't?
_rant_off_
Just about the only thing that you can do is to teach
the perl-deprived; and one of the most immediate ways to do
that is to fully document the code.
Just my two centimes.
--
I knock my pate and fancy wit will come
Knock as I please, there's no one at home
a pontiff paraphrased
| [reply] [d/l] |
Re: Trying to avoid line noise (brain cramp)
by Corion (Patriarch) on Mar 09, 2001 at 17:23 UTC
|
The main problem I see is, that you are using idiomatic Perl
to solve the problem, which makes the solution elegant
but requires Perl knowledge to understand the stuff.
I see only one good solution to your problem, which does not
change the code, because I see only minor things
to change there (in fact, I'd change the die part to something
like ... or die "couldn't open output file '$filename' : $!\n".
I would put a comment that describes what each of the idiomatic
constructs does, like this :
my ( @results, %fields );
while(<>){
# Strip all NULL characters out of the current line
s/\0//g;
if (/(?<!from )INSERT INTO Photo/) {
# Split that line into three parts and save a reference to i
+t
my @s = ( /([^(]+)(\([^)]+\))(.*)/ );
push @results, \@s;
}
}
my @final = sort{ $a->[1] cmp $b->[1] } @results;
open OUT, ">results.txt" or die $!;
foreach( @final ) {
# Print only one occurence of each statement
my $fieldNames = $_->[1];
if ( ! exists $fields{ $fieldNames } ) {
$fields{ $fieldNames } = 1;
print OUT @{ $_ }
}
}
close OUT;
This will of course need some learning of Perl on the
recipient side, but I think that those comment lines should
help the "casual reader" to understand the program flow
(if the "casual reader" knows english that is. | [reply] [d/l] |
Re: Trying to avoid line noise (brain cramp)
by clemburg (Curate) on Mar 09, 2001 at 18:38 UTC
|
Avoiding line noise in Perl is mainly about how to
handle regexes, IMHO.
A good tip here is to use variables in the regexes:
# original: /([^(]+)(\([^)]+\))(.*)/
my $not_paren = "[^(]+";
my $paren = "\\(";
my $rest = ".*";
# now reads:
$string =~ /($not_paren)($paren$not_paren$paren)($rest)/;
This will do a good job to make the code readable
to other programmers, in my experience.
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com | [reply] [d/l] |
|
I like your code, but there's a small error:
# original: /([^(]+)(\([^)]+\))(.*)/
my $not_paren = "[^(]+";
my $leftparen = "\\(";
my $rightparen = "\\)";
my $rest = ".*";
# now reads:
$string =~ /($not_paren)($leftparen$not_paren$rightparen)($rest)/;
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] |
|
Ah, yes, sorry. So much for not testing everything.
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com
| [reply] |
|
Here's one more way to de-line-noise-ify the code:
my $leftparen = quotemeta "(";
my $rightparen = quotemeta ")";
This is more legible than "\\(" or "\Q(\E".
-- Frag. | [reply] [d/l] |
|
s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
into this:
s{ < # opening angle bracket
(?: # Non-backreffing grouping paren
[^>'"] * # 0 or more things that are neither > nor
+' nor "
| # or else
".*?" # a section between double quotes (stingy
+match)
| # or else
'.*?' # a section between single quotes (stingy
+match)
) + # all occurring one or more times
> # closing angle bracket
}{}gsx; # replace with nothing, i.e. delete
It's still not quite so clear as prose, but it is very useful for describing the meaning of each part of the pattern.
Again, hope that helps,
Brother Ade | [reply] [d/l] [select] |
Re: Trying to avoid line noise (brain cramp)
by mirod (Canon) on Mar 09, 2001 at 17:58 UTC
|
Instead of one big regexp that splits the line you could
consume it chunk by chunk, which lets you break the
regexp into smaller pieces and give names to the fields.
In any case you will have to deal with the fact that (s
and )s have to be escaped so it will look ugly no matter what you try.
Is there a Perl 6 RFC that would allow us to change that character?
Update: I have worked on the display part too, and figured
that as naming things is good I should go all the way and
use a hash to store the 3 parts of the SQL instruction. Plus
I am not very familiar with hash slices so it was a good
opportunity to use one ;--)
#!/bin/perl -w
use strict;
my ( @results, %fields );
my $TABLE= 'Photo';
while(<DATA>){
s/\0//g;
if ( s/(?<!from )(INSERT INTO $TABLE\s*)//) {
my $table= $1; # easy to get
+it here
my $fields= $1 if( s/(\([^(]+\))//); # the complex
+one
my $values= $_; # just grab th
+e rest
push @results, { table => $table, # naming thing
+s is good!
fields => $fields,
values => $values
};
}
}
my @final = sort{ $a->{fields} cmp $b->{fields} } @results; # the so
+rt is easier to understand
open OUT, ">results.txt" or die $!;
foreach my $insert ( @final ) {
my $fieldNames = $insert->{fields};
if ( ! exists $fields{ $fieldNames } ) {
$fields{ $fieldNames } = 1;
print @$insert{qw(table fields values)}; # and n
+ow the hash slice
}
}
close OUT;
__DATA__
go
INSERT documents ( did, DocID, Template ) VALUES ( 3093, '000000000000
+000000003093', 'Photo' )
go
SELECT * FROM documents WHERE did=3093
go
INSERT INTO Photo (AccessLevelID, ANRNumber, ColorID, ContactSheetID)
+VALUES (1, '225058', 1, 4)
go
exec new_int_id 'Photo', 'PhotoID'
go
exec new_str_id 'documents', 'DocId'
go
INSERT INTO Photo (QualityID, Remark, UniqueID, VolNum) VALUES (1, 'St
+. Nicolaasfeest -v. Moorsel', '211357', '1')
go
| [reply] [d/l] |
Re: Trying to avoid line noise (brain cramp)
by Masem (Monsignor) on Mar 09, 2001 at 17:40 UTC
|
A couple more comments:
Comment, comment, comment. If you interdisperse enough plain english (or whatever spoken language) amongst your code, it will look less like line noise and more like reasonable coding. In particular for lines that may seem to be 'line noise' due to just how perl handles them, in this case the regex lines, and the sort call. It makes it easy for the non-perl programmers to understand the logic here.
Avoid using the 'shortcut' identifiers like $_ and $1, at least, assign other more descriptive variable names to these values, and if you have to specifically say what those are, comment them as well. It might add overhead, but if you need to design for maintainence and usability of the code, it's probably worth the tradeoff.
For functions like open, exists, etc, where the parans around the arguments aren't necessary, it might help those more used to C and other languages to use parens to group arguements appropriately, such that they look less like 'loose words' and more like function calls.
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
| [reply] |
|
If you try to avoid $_, $' and the likes, you can't avoid the "English.pm" module.
So, after use English;, instead of $_ you can use $ARG, for $´ you have $PREMATCH, $. is $INPUT_LINE_NUMBER... you get the idea
See perlvar for details.
HTH
Brother Ade
| [reply] [d/l] |
Re: Trying to avoid line noise (brain cramp)
by japhy (Canon) on Mar 09, 2001 at 19:20 UTC
|
<plug type="shameless">You could suggest they check out the OGRE for an explanation of the regex, too. ;) </plug>
But honestly, you might want to avoid the regex altogether. My solution might be construed as worse line noise, though, but it depends on how you comment it.
use warnings;
use strict;
my (@results, @final, %seen);
while(<>){
tr/\0//d; # strip NULs
# if we're INSERTing into Photo (but not from)...
if (/(?<!from )INSERT INTO Photo/) {
# get: BEFORE (INSIDE) AFTER from string
my @s = split /[()]/;
push @results, [ $s[0], "($s[1])", $s[2] ];
}
}
# sort by the values in parens
@final = sort { $a->[1] cmp $b->[1] } @results;
open OUT, ">results.txt" or die $!;
for my $line (@final) {
my $fieldNames = $line->[1];
# if we've never seen these values, print
if (not $seen{$fieldNames}) {
print OUT @$line;
$seen{$fieldName}++;
}
}
close OUT;
Of course, you could split more trickily, but it looks more like line noise then.
# split BEFORE a ( or AFTER a )
my @s = split /(?=\()|(?<=\))/;
push @results, \@s;
### that even lets you get away with leaving out @s
push @results, [ split /(?=\()|(?<=\))/ ];
And you could condense the "seen" logic to:
# display the line if we've never seen the fields
print OUT @$line if not $seen{$line->[1]}++;
But that might be too idiomatic for lazy people to spend time understanding.
japhy --
Perl and Regex Hacker | [reply] [d/l] [select] |
Re: Trying to avoid line noise (brain cramp)
by ChOas (Curate) on Mar 09, 2001 at 17:23 UTC
|
Hi!
I am sorry, but..
I might have misunderstood the question...
would this work: ?
#!/usr/bin/perl -w
use strict;
my %DataBase;
while(<>)
{
$DataBase{$_}++ if /(?<!from )INSERT INTO Photo/;
};
open OUT, ">results.txt" or die $!;
print OUT for sort keys %DataBase;
close OUT;
GreetZ!,
print "profeth still\n" if /bird|devil/; | [reply] [d/l] |
Re: Trying to avoid line noise (brain cramp)
by satchboost (Scribe) on Mar 10, 2001 at 01:11 UTC
|
One issue I would be concerned with is why you have non-PERL coders looking at PERL code. As a test tools developer for a major corporation, we have allowed our users to write "privates" for their own testing needs. We also provided an English-like scripting language (gotta love eval!!) for them to automate most of their activities. Because of this, we have maybe 3 or 4 users who actually bother trying to write their own privates and the rest simply use the provided scripting language.
I guess what I'm saying is that if you treat PERL as a scripting language that anyone with half a brain should be able to read without a problem (like BASIC), then you're going to run into problems. If you treat it as a full-fledged programming language that happens to be interpreted and not compiled, then your users will treat it with the appropriate respect. A given user wouldn't try and read your C, FORTRAN, or COBOL code, would they?
| [reply] |
|
| [reply] |
Re: Trying to avoid line noise (brain cramp)
by markjugg (Curate) on Mar 12, 2001 at 07:09 UTC
|
Out of all the ideas posted above, the one that resonates most
with me so far is the idea of using a hash instead of an array
for part of it. I would start here:
my @s = ( /([^(]+)(\([^)]+\))(.*)/ );
Using a hash would clarify what's being matched here (I'm not sure
myself. ). Maybe it would be something like this (using a hash
slice).
my @s{qw/table fields values/}) = ( /([^(]+)(\([^)]+\))(.*)/ );
that means that later on in the code you would be comparing
{fields} instead of [1], which would be more intuitive. That
s usually a trigger point for me-- If I'm doing much with the
arrays by calling out specific positions like that, I usually ask
myself "would be clearer to use a hash here?".
Perhaps related, one of my favorite uses for hash slices at the
moment is to put the results of a DBI selectrow_array statement into
a hash in a single step like:
@row{qw/name company title/} = $DBH->selectrow_array("...");
-mark | [reply] [d/l] [select] |
|
|