Seeking Algorithm

WhiteBird has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Seeking Algorithm by Zaxo (Archbishop) on May 28, 2004 at 03:27 UTC
My first reaction was to split and grep, but since the numbers are so tightly defined, I think a regex should do, `my @keepers; open my $fh, '<', '/path/to/file.dat' or die $!; while (<$fh>) { while ( /(\d+\.\d+)/g ) { push @keepers, $_ and last if $1 >= 296.1 && $1 <= 314.0; } } close $fh or die $!;` [download] The `while (/(foo)/g)` construction loops over all the matches, putting the matching text in $1. Update: Missed the exclusion for 305.1, grep it is: `my @keepers; open my $fh, '<', '/path/to/file.dat' or die $!; while (<$fh>) { my @data = grep { /^\d+\.\d+$/ && $_ >= 296.1 && $_ <= 314.0 } split ','; if (@data) { push @keepers, $_ unless @data == 1 and $data[0] eq '305.1'; } } close $fh or die $!;` [download] Update2: CountZero++ is correct, code amended. After Compline, Zaxo	[reply] [d/l] [select]
Re: Re: Seeking Algorithm by CountZero (Bishop) on May 28, 2004 at 10:18 UTC
Are you sure your updated version works? Some of the data have a 'V' prefixed to the figures and you do not seem to filter this out as you did in your first example. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Seeking Algorithm (jump early) by tye (Sage) on May 28, 2004 at 04:13 UTC
I'd probably do it like this to avoid splitting and comparing items after a match is already found for a line: `while( <> ) { while( /(\d+(?:\.\d+)?)/g ) { if( $1 ne '305.1' && 296.1 <= $1 && $1 <= 314.0 ) { print; last; } } }` [download] (updated slightly) - tye	[reply] [d/l]
Re: Seeking Algorithm by atcroft (Abbot) on May 28, 2004 at 03:35 UTC
Just looking briefly at it, the following comes to mind. I'm going to make a bit of an assumption in that you just want to keep a copy of the lines, for processing later. `# $filename is already defined my (@keptrecords); open(INFILE, $filename) or die("Can't open $filename for input: $!\n"); while (my $line = <INFILE>) { my $tokeep = 0; my $checkexclusion = 0; my @parts = split(/,/, $line); foreach (@parts) { $checkexclusion++ if ($_ eq '305.1'); $tokeep++ if (( 296.1 <= $_ ) and ( $_ <= 314.0 )); } next if (( $checkexclusion ) and ($tokeep <= 1)); push(@keptrecords, $line); }` [download] Just a thought, based on the idea you presented. Hope that helps....	[reply] [d/l]
Re: Seeking Algorithm by dragonchild (Archbishop) on May 28, 2004 at 13:13 UTC
Let me guess - you're doing a report on all patients that had a diagnosis of type yadayadayada and they want to exclude the diagnosis for blahblahblah. You were also told that this was a one-off report and they'd never want to do the same thing, but with the code for something else excluded. Do yourself a favor and put this into a database. I'm going to assume that each line is a patient. So, create a patient table and a diagnosis table. Then, you can do something like: `SELECT patient.id FROM patient, diagnosis WHERE diagnosis.patient = patient.id AND diagnosis.code BETWEEN (296.1 AND 314.0) AND NOT EXISTS ( SELECT patient FROM diagnosis d_int WHERE d_int.patient = patient.id AND d_int.code = 305.1 )` [download] Or something like that. The benefit here is that you have the data in a database for the next stupid question you get asked. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]
Re: Re: Seeking Algorithm by WhiteBird (Hermit) on May 28, 2004 at 15:51 UTC
<grin>Well, you're right on with your description of the requested task. (Have you spent some time in healthcare?) I had started out with a database, but the potential of 10 diagnosis columns and thinking about splitting the data into tables and writing the query seemed a bit laborious. I've not used the NOT EXISTS syntax before--I'll have to look into that for future reference. Thanks.	[reply]
Re: Re: Re: Seeking Algorithm by Anonymous Monk on May 28, 2004 at 17:59 UTC
You don't have to work in healthcare for this kind of observation. It's a very common 'goose-chase' scenario that people do when they would really be better off using a database.	[reply]
Re: Seeking Algorithm by tkil (Monk) on May 28, 2004 at 07:18 UTC
Depending on how precise you want to be about equality, you might need to do a fancy floating-point compare: `my $EPSILON = 0.01; sub approx_equal ( $ $ ) { my ( $f1, $f2 ) = @_; return abs( $f1 - $f2 ) < $EPSILON; }` [download] The most straigntforward filter-style code I could come up with: `while ( <> ) { my @in_range = grep { /^[\d.+-]+$/ && 296.1 <= $_ && $_ <= 314.0 } split /,\s*/; # skip records with only 305.1 (room temp kelvin?) in range next if @in_range == 1 && approx_equal( $in_range[0], 305.1 ); print; }` [download] If this is too slow, it might be faster to bail out as soon as we know we can: Read more... (1072 Bytes)	[reply] [d/l] [select]
Re: Seeking Algorithm by Abigail-II (Bishop) on May 28, 2004 at 07:39 UTC
There's no need for an "approx_equal" to do fancy floating point compare. You get the numbers from a split - so you have strings. Just use string compare and avoid the uncertainy problems you have with floating point compare. Abigail	[reply]
Re^2: Seeking Algorithm by tkil (Monk) on May 28, 2004 at 15:55 UTC
There's no need for an "approx_equal" to do fancy floating point compare. You get the numbers from a split - so you have strings. Just use string compare and avoid the uncertainy problems you have with floating point compare. At one point in the evolution of my response, I was relying on strings to avoid this problem. Three reasons I switched: The original post didn't make it clear that all numbers would have exactly one decimal point. If the row contained 305.10, did that count? How about 305.099? I can never remember if scalars keep both their string and numeric natures at the same time. After using a numeric comparison against a given scalar, is it now just a number, or are both kept around? (I could find out by research or experimentation, but the fact that I had to think about it, even after 10+ years of using Perl, makes me think that I should avoid this subtlety.) Finally, it was a way to throw in an educational tidbit "for free". The original author didn't specify the example very precisely; by including this in my response, it would hopefully help them think about it more clearly. (And/or I was showing off. You decide.)	[reply]
Re: Re^2: Seeking Algorithm by WhiteBird (Hermit) on May 28, 2004 at 17:30 UTC
Re: Seeking Algorithm by WhiteBird (Hermit) on May 28, 2004 at 15:40 UTC
Thanks all for your input and rapid response. This is just what I was looking for--even more than I anticipated. I find that Tye's solution returns what I need when run against my test data. More testing is in order and I'll probably work a little with the other suggestions just to broaden my horizons and explore TIMTOWTDI. Here's the code I'm working from now: use strict; my $outfile = "\\NewCodes.txt"; my $infile = "C:\\TestTBL.txt"; # $filename is already defined open (LOGFILE, ">$outfile") or die "Can't create logfile: $!"; open NAMES, $infile or die "Can't open input datafile: $!\n"; while( <NAMES> ) { while( /(\d+(?:\.\d+)?)/g ) { if( $1 ne '305.1' && 290.0 <= $1 && $1 <= 314.0 ) { print LOGFILE; last; } } } ___DATA 4106693","01/01/00","305.1","473.9","465.8",,,,,,, "SP","000154106774","01/01/00","511.0","305.1",,,,,,,, "SP","000154107231","01/01/00","305.1","490","786.50",,,,,,, "SP","000154107398","01/01/00","558.9","305.1",,,,,,,, "SP","000154108114","01/02/00","305.1","789.00","792.1",,,,,,, "MCD","000154108661","01/03/00","305.1","922.31","E960.0",,,,,,, "SP","000154110313","01/03/00","305.1","784.0","780.4",,,,,,, "SP","000154111573","01/03/00","493.91","305.1",,,,,,,, "MCD","000154111581","01/03/00","599.0","305.1","V14.0",,,,,,, "MCD","000154115056","01/04/00","789.00","305.1",,,,,,,, "SP","000154122133","01/07/00","786.50","305.1",,,,,,,, "MCD","000154125507","01/08/00","305.1","789.00","412",,,,,,, "SP","000154125736","01/08/00","305.1","E849.5","847.0","847.9","924.8 +","E812.0",,,, "SP","000154126007","01/08/00","490","305.1",,,,,,,, "MCD","000154126147","01/08/00","305.1","729.1","346.90",,,,,,, "MCD","000154126775","01/09/00","305.1","525.8","V45.89",,,,,,, "MCD","000154127127","01/09/00","305.1","648.93","346.90",,,,,,, "SP","000154128875","01/10/00","873.63","305.1",,,,,,,, "SP","000154129243","01/10/00","789.01","305.1",,,,,,,, "SP","000154130535","01/10/00","729.5","305.1",,,,,,,, "MCD","000154107771","01/02/00","301.9","780.39",,,,,,,, "MCD","000154115773","01/05/00","802.0","910.0","E960.0","E849.9","250 +.01","311","758.7",,, "MCD","000154118241","01/05/00","787.03","295.70","737.30",,,,,,, "MCD","000154125621","01/08/00","295.70","714.0",,,,,,,, "MCD","000154125957","01/08/00","812.20","E885","E849.9","309.81",,,,, +, "MCD","000154127101","01/09/00","310.1","305.1","E869.4","E849.0",,,,, +, "MCD","000154127160","01/09/00","295.90",,,,,,,,, "MCD","000154130519","01/10/00","723.4","847.0","784.0","E884.3","E849 +.9","907.2","344.1","E929.8","305.1","V44.3" "MCD","000154131701","01/11/00","300.4",,,,,,,,, "MCD","000154134085","01/11/00","298.9","305.1",,,,,,,, "MCD","000154141162","01/13/00","558.9","493.90","296.7",,,,,,, "MCD","000154141308","01/13/00","784.0","300.01",,,,,,,, "MCD","000154146431","01/16/00","312.9","536.9",,,,,,,, "MCD","000154152644","01/18/00","313.81",,,,,,,,, "MCD","000154154507","01/18/00","298.8","305.1",,,,,,,, "MCD","000154158766","01/20/00","842.00","E826.1","E849.9","314.00",,, +,,, "MCD","000154159631","01/20/00","303.90","305.1",,,,,,,, "MCD","000154161007","01/20/00","313.81","311",,,,,,,, "MCD","000154161589","01/20/00","786.50","300.00",,,,,,,, "MCD","000154165410","01/22/00","305.00",,,,,,,,, [download]	[reply] [d/l]
Re: Seeking Algorithm by tkil (Monk) on May 28, 2004 at 16:19 UTC
Potentially offensive to some, but in the spirit of TMTOWTDI: The person requesting this data suggested that I dump it into an excel spreadsheet, sort the data and manually remove the records that I don't need. That seems way too labor intensive to me There are pretty easy ways to do this in Excel. Take a look at the worksheet function `COUNTIF`; combined with a bit of multi-stage arithmetic and then filtering, you should be able to narrow down your set of data pretty easily. Hm, after a bit more research, this might not be so easy after all, since `COUNTIF` only takes one critirion. This can be worked around at least two ways: first, count entries above the range, add the number of entries below the range, then subtract that from the total number of entries. Another solution is to use `SUMPRODUCT` with boolean tests. Either way, it looks like it can be done with three additional columns (# in range, # equal to 305.1, include yes/no). Update Works out reasonably well. Doing a text import of the CSV gives me 13 columns (A-M). The first three are identifiers (initials, id, and date), then the ten values (columns D-M). Starting at row 2 (after adding a row of headers), I have: `N2 = COUNTIF(D2:M2,">=296.1")-COUNTIF(D2:M2,">314.0") O2 = COUNTIF(D2:M2,305.1) P2 = NOT(AND(N2=1,O2=1))` [download] Workbook here (Excel 2000, 27KB). Just out of curiosity, how is one supposed to interpret the values with letters in front?	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks

Seeking Algorithm

Update