http://qs321.pair.com?node_id=332004

blue_cowdawg has asked for the wisdom of the Perl Monks concerning the following question:

I have some ideas on how to do what I want, but I thought I would pose this question as I know that some of the monks may come up with better ideas than mine.

The problem

I have collected some data (blood glucose levels) over about a 4 week period of time. The correlation I would like to do is an average of those readings by time of day. For instance if you look at a subset of the data that I have collected I have:
 

Blood Glucose by Date

Date
and
Time
Blood Glucose
Hg/Ml
2004-02-01 07:0182
2004-02-01 11:38172
2004-02-01 22:48154
2004-02-02 05:38107
2004-02-02 13:20117
2004-02-02 23:48188
 
As you can see by the subset of data my sample times do not occur at exactly the same times every day depending on my schedule, when I get up, what dumb meetings interrupt my flow, etc.

What I want to do is average these readings by time of day such that I calculate averages for from 4:30AM (when I get up days I go to the gym) all the way through until midnight. Some interpolation is going to be called for since I do not test every hour and my samples are essentially 3 to 4 a day (more when I feel crappy).

Thoughts anyone?


Peter L. Berghold -- Unix Professional
Peter at Berghold dot Net
   Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.

Replies are listed 'Best First'.
Re: Data averages by time of day
by Abigail-II (Bishop) on Feb 26, 2004 at 15:39 UTC
    It's not clear to me what you really want. Perhaps you want "averages" of certain pre-picked points in time. Say, the average of 7AM, 11AM, 3PM, 7PM and 12midnight. Or maybe you you a way where you can pick a time, plug it into a formulae, and the average rolls out of it.

    You might want to interpolate, but there are many ways of interpolating. A piece-wise linear interpolation of successive datapoints is the simplest, but it's also crude. And even then there are decisions to be made. Suppose you want the average level on 11AM, and you have datapoints on 9:15AM on day 1, 12:30 PM on day1, 8:30AM on day2 and 11:55AM on day2. You can do interpolation between the 9:15AM and 11:55AM times - they are the times closest to 11:00AM, but they are from different days. You can also first interpolate the two points of the first day, then interpolate the two points on the second day, then average.

    But is linear interpolation appropriate? Many biometric measurements are supposed to vary during the day, perhaps according to a sinal wave. You can then do a better approximation of the value by taking this into account. You can even increase accuracy by doing interpolation (be it linear or something else) by including more points. But not blindly, it can also decrease the accuracy, specially if you use the "wrong" interpolation.

    Alternatively, you can make a "best fit" curve using all the measurements of a day, or even all the measurements. With a best fit curve, you typically first decide what form the curve should have (a line, a parabole, a sinal wave, etc), and then from all the possible curves of that form, you find the one that minimizes some distance from your measure point. An often used distance is the sum of the squares of the distances of all the measured points to the curve. There are standard techniques for this.

    4:30AM (when I get up days I go to the gym)
    Man, that can't be healthy. 4:30AM is a time to say goodnight on IRC, write a final post on Perlmonks, save your days work in CVS, check to see if there's a tape in your backup unit, and to hide in your bed for that big hot thing that's about to rise the skies.

    Abigail

      I question the usefullness of calculating the average. The doctor already knows what a baseline glucose level is supposed to be, I suspect he would be more intrested in viewing just a raw plot of the data over whatever period of time the tests where taken over, it should be immediately visually apparent where the spikes are, and if the glucose levels are generally below or above the norm... blue_cowdawg has your physician given you any indication on what data would be of interest to him/her?
      daN.
Re: Data averages by time of day
by gmax (Abbot) on Feb 26, 2004 at 17:05 UTC

    This sort of problems can be handled with a DBMS engine.

    Something like this?

    #!/usr/bin/perl -w use strict; use DBI; my @data = ( ['2004-02-01 07:01', 82], ['2004-02-01 11:38', 172], ['2004-02-01 22:48', 154], ['2004-02-02 05:38', 107], ['2004-02-02 13:20', 117], ['2004-02-02 23:48', 188]); my $dbh = DBI->connect("dbi:SQLite:averages","","") or die $DBI::errstr; unless ( -s "averages") { $dbh->do(qq{ CREATE TABLE averages ( d datetime not null, val INTEGER)}) or die $DBI::errstr; for (@data) { $dbh->do( qq{INSERT INTO averages VALUES (?, ?)}, undef, @$_) or die $DBI::errstr } } my $query = qq{ SELECT substr(d,1,10) as day, CASE WHEN substr(d,12,2) BETWEEN 4 AND 12 THEN "1 morning" WHEN substr(d,12,2) BETWEEN 13 AND 18 THEN "2 afternoon" WHEN substr(d,12,2) BETWEEN 19 AND 24 THEN "3 evening" ELSE "4 night" END AS tm, AVG(val) FROM averages GROUP BY day, tm ORDER BY day, tm }; my $averages = $dbh->selectall_arrayref($query) or die $DBI::errstr; printf "%-10s %-15s %5.2f\n", @$_ for @$averages; __END__ output: 2004-02-01 1 morning 127.00 2004-02-01 3 evening 154.00 2004-02-02 1 morning 107.00 2004-02-02 2 afternoon 117.00 2004-02-02 3 evening 188.00

    You can add/remove/modify the intervals by acting on the CASE statement inside the query.

    If you want an average by time only, without the days, then use this query:

    my $query = qq{ SELECT CASE WHEN substr(d,12,2) BETWEEN 4 AND 12 THEN "1 morning" WHEN substr(d,12,2) BETWEEN 13 AND 18 THEN "2 afternoon" WHEN substr(d,12,2) BETWEEN 19 AND 24 THEN "3 evening" ELSE "4 night" END AS tm, AVG(val) FROM averages GROUP BY tm ORDER BY tm }; my $averages = $dbh->selectall_arrayref($query) or die $DBI::errstr; printf "%-15s %5.2f\n", @$_ for @$averages; __END__ output: 1 morning 120.33 2 afternoon 117.00 3 evening 171.00
     _  _ _  _  
    (_|| | |(_|><
     _|   
    

          Something like this?

      Something along those lines was one of the ideas I was playing around with but with finer blocks of time.

      select hg from readings where time > "04:00" and time < "4:30"; . . . and so on
      and then averaging the time slots. Where I have no values in a given time slot I could to linear interpolation. Abigail makes a very valid point that I am trying to wrap my head around to figure out how to implement it is the theory (fact?) that there may be a sinusoidal relationship to consider as well.

      I beginning to think that I have a good basis for a posting to CUFP when I get this working.


      Peter L. Berghold -- Unix Professional
      Peter at Berghold dot Net
         Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
Re: Data averages by time of day
by kappa (Chaplain) on Feb 26, 2004 at 19:18 UTC
    Interpolation isn't probably relevant in this case. Bestfit curve (or approximation) which Abigail mentions is. This is why: we want averages (read: good, useful values) in arbitrary Xs, while interpolation produces a function which will exactly evaluate to those Ys from those Xs we used as data for interpolation. Nothing said about Xs between original datapoints, they can be anything. Think of a sine wave which perfectly interpolates points (0, 0), (pi, 0), ..., (n*pi, 0).

    Approximation minimizes distance between the curve and each of the datapoints (the datacloud) to produce exactly what you want: a function to calculate useful value in any point of the X axis.

    Give Math::Approx a try. Like this:

    my %data = ( mins_from_midnight('07:01') => 82, mins_from_midnight('23:48') => 188, # ... ); my $approx = Math::Approx->new(undef, 3, %data); print $approx->approx(mins_from_midnight('15:40'));

    Update: and yes, the day you took a specific datapoint does not matter. Anyway, you want average and that means all the days are the same for your task. That's ok.

      I love it! :-)


      Peter L. Berghold -- Unix Professional
      Peter at Berghold dot Net
         Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
      To get at the formula that results from the approximation, try this: (The normal Math::Approx methods also work on Math::Approx::Symbolic objects.)
      use Math::Approx::Symbolic; my %data = ( (7*60+1) => 82, (23*60+48) => 188, ); my $approx = Math::Approx::Symbolic->new(undef, 3, %data); my $sym = $approx->symbolic(); print "$sym\n"; my $prettier = $sym->to_latex(); print "$prettier\n"; # Computing y-values: print $sym->value(x => time_of_day()), "\n"; # Doing that faster: use Math::Symbolic::Compiler; my ($closure) = $sym->to_sub(); print $closure->(time_of_day()), "\n"; # Way faster: use Math::Symbolic::Custom::CCompiler; my $inlined_c = $sym->to_compiled_c(); print $inlined_c->(time_of_day()), "\n";
Re: Data averages by time of day
by TomDLux (Vicar) on Feb 26, 2004 at 15:57 UTC

    This is a standard problem in real-life time-series data collection. In the classroom, data are collected at T = 0, 1, 2, ...; in real life, the collection times have a certain amount of jitter, and some data values may be missing altogether. If you're serious about the correct solution, talk to a mathematician or experimental scientist, or check out web sites dealing with math & stats. ( I would start with Wolfram and work out from there. )

    If your interest is less rigorous, I see two alternatives:

    1. Collect data over a week or two, to get a representational sample of the time slots when you check your glucose. On any one day, you only collect three data samples, out of seven typical times which are used at various times, so the other four slots are empty.
    2. Interpolate from the actual values to determine the theoretical values at the theoretical times.
    3. Ask your doctor / give your doctor the raw data.
    4. Some of the glucometers give the impression they don't require a lancet. Get one of those and an alarm clock, take your glocuse reading at the correct time, whether you're in a meeting, at lunch, or in the washroom.

    Yeah, I know I have trouble with small integers.

    This reminds me of problems discussed by ... I think it was Douglas R. Hofsteder: If "when I wake up" is either 4:30 or 6:00, and "lunchtime" is either 12:00 or 2:45, which set does 10:15 belong to, and what's the equivalent time in the other?

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Data averages by time of day
by Fletch (Bishop) on Feb 26, 2004 at 15:03 UTC

    Not that it's got anything to do with perl directly . . . :) Check out RRDtool which handles data just like this.

          Check out RRDtool which handles data just like this.

      Good thought, but not really what I'm after. I plan on this but just a subset of a bunch of reports that I going to generate to give to a doctor.

      UPDATE:

      The other half of the problem that is out of scope of this thread is the fact that the Perl script I am writing also extracts that data from the Palm Pilot I use to record my readings


      Peter L. Berghold -- Unix Professional
      Peter at Berghold dot Net
         Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
        So are you looking to group the day into "day parts" of the same size each day, so say call them morning lunch afternoon and evening, and then grouping tests by what group they fall in? or more of the given 3-4 tests as points throughout the day put curved lines between the points so you can "guess" what the glucose reading was at some point between your tests? I would go with the daypart methode, only because whatever you do to try and guess what the reading would have been at a given point I would suspect it will be wrong, because of factors like when/what you ate.
        daN.
Re: Data averages by time of day
by RandomWalk (Beadle) on Feb 26, 2004 at 15:47 UTC
    Peter

    Do you have PDL installed? Convert your times to scalers, minutes, say. So every day you have 2 pdl's, one of times, one of readings. Use PDL::interpolate (linear interpolation is probably all you want) for a common grid for all days readings.

Re: Data averages by time of day
by dragonchild (Archbishop) on Feb 26, 2004 at 15:35 UTC
    It looks like you eventually want to get a smooth curve for the hours of the day, given the data from a group of days. I'd look at some basic statistical analysis. (No, I have no suggestion which ones - sorry.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

          It looks like you eventually want to get a smooth curve for the hours of the day, given the data from a group of days.

      Exactly. And statistics was my mathmatical weak area when I was in school. Great with math other than that, but statistics I seemed to have trouble figuring out which algorithm to apply to what problem.


      Peter L. Berghold -- Unix Professional
      Peter at Berghold dot Net
         Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
Re: Data averages by time of day
by NetWallah (Canon) on Feb 26, 2004 at 19:16 UTC
    I'm surprised no has mentioned using MRTG to graph this .

    The Multi Router Traffic Grapher (MRTG) is a tool to monitor the traffic load on network-links. MRTG generates HTML pages containing graphical images which provide a LIVE visual representation of this traffic

    Although it talks about "network traffic", it has been used to monitor lots of other things, like weather, surf, temperature...

    MRTG optionally can work with rrdtool or use its own file-based storage.

    While it does not directly give averages, it provides a great graphic display, and is very useful for observing data over time.

    Yes - it is written in perl. Yes, it is free. Yes, it is good!

    "Experience is a wonderful thing. It enables you to recognize a mistake when you make it again."
Re: Data averages by time of day
by signal9 (Pilgrim) on Feb 26, 2004 at 16:19 UTC

    It looks like the problem is not one of setting stats against clock hours, but against hours w/in "your day", so the daily interval will be of varying length and periodically shift against the backdrop of the clock. It's a stretch, but perhaps one of the AI::Fuzzy* modules could help. If breakfast could fall anywhere from 4:30am to 7:am, rather than encompass that span of time in segment create a 'breakfast' time segment with fuzzy edges. That way rather than asking of your data, "What is my glucose level at 6:am?" you can ask, "What is my glucose level in the morning?"

    Just a thought, and not thought out thoroughly.

Re: Data averages by time of day
by diskcrash (Hermit) on Feb 26, 2004 at 23:43 UTC
    Dear Blue,

    There are at least three separate issues to consider:

    1. That blood glucose level is a function of food intake and exercise and that you probably can't easily take readings that reflect all of the effects, in a timely way. So the flow of data is probably undersampled.

    2. That an average value is only one of several, relatively easy, metrics of interest. You can have a 3,4 or 5 point rolling average (a new data point came in, drop the earliest one in the average and move the data window along). You might also track minimum, maximum or the median ("middle point") during a 24 hour day.

    3. Look at the mission. You track glucose to determine if your treatment is right. Is the right amount of the right medication, combined with the right diet and exercise controlling your blood glucose level? Tracking daily maxima and morning fasting levels might tell you a lot about how well you are controlled. Imagine also a "glucose control index" that you might invent. For example, (morning fasting level / maximum during any point in the day) and plot or track this ratio to see how your body is responding to diet and meds. Talk to your doc about what they feel are the most important attributes about glucose tracking. Some meds are really "peaky" where others provide more gradual control. These are the kind of features that can tell you about the underlying mechanism.

    Best of luck to you and keep up the fight.

    Diskcrash

Re: Data averages by time of day
by johndageek (Hermit) on Feb 26, 2004 at 20:02 UTC
    Non-perlish question:How long have you played the glucose monitoring game?

    I am asking because most glucose monitors allow a doctor's office to grab the information right from themonitor and print out a nice set of graphs for the M.D..

    Assuming the doc has done an A1C, and now wants more detailed information, he may not need a perl solution.

    Bet you can't guess why I am posting this. Yup, I played with the numbers and made a cool presentation and so on, then the doc plugged the monitor in, printed his graphs and tada! My app was obsolete. Still a fun problem though.

    The information that might add a lot to the readings is time since medication and time since eating and time since exercising.

    Good luck!

    Enjoy!
    Dageek

          Non-perlish question:How long have you played the glucose monitoring game?

      For about 15 years now. My experiences have been very mixed on this score vs. a vs. the doctor. None of the doctors I've visited had the technology to plug my meter in (One Touch Profile) and download the results.

      My current doctor, (soon to be my ex-doctor), didn't even look the reports I generated for him. I am in the throws of going to a new one but I want to go in "armed and dangerous."


      Peter L. Berghold -- Unix Professional
      Peter at Berghold dot Net
         Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
        Hi,
        I've been a diabetic for 20 years but only decided to start serious monitoring 3 years ago. I've been testing my blood 20-25 times a day for 3 years. Over that time I've switched from meter to meter and I'm now using the One Touch Ultra. It has some statistical analysis on it, but it's still poor (in my opinion). A few years ago I started with an Excel spreadsheet and it has evolved to a point where I find it to be very effective. It tracks a 24 hour period and has a graph. The user enters the time of the test and the reading. The sheet will calculate the velocity, acceleration and the jerk (the rate of change of acceleration) and has things like the day's average, mean and standard deviation. I've found that my doctor isn't interested in these numbers, but he does like the 7, 14, 30, 60 and 90 day averages that my meter produces. Anyway, if anyone is interested in my spreadsheet, I'll send it...
        Cheers,
        Trevor Keppel-Jones.
        tkjones@jetnet.ca
Re: Data averages by time of day
by Hutch (Acolyte) on Feb 27, 2004 at 03:18 UTC
    Hey Blue, I was reading the posts on your problem and I would probably tackle the problem the same way. As an academic exercise, regardless of whether the doctor finds it useful or not, the first thing you'd need to do would be to set ranges for the time schedules. Larger ranges (Say 4am to 5am) would be less specific, but tighter ranges (4:00am-4:05am) would probably not gather as many time points as you would want to gain an average. If you want an easy way to do this, I would probably set up a small PERL script to write an interface file from form data. I'd use a dropdown box to allow them to pick the time ranges and have each range assigned a numerical value that corresponds to the line in the array that the data is stored. (IE, line 1 is 4-4:30, line 2 is 4:30-5 etc. etc.) The PERL script would need to parse the data from the form, read the current data from the file into an array. Dump the new data from your form into the array, and rewrite the file. You then would have a file that had all of the data, categorized, by timepoint. Last step is to create a script that reads the array and averages values stored in the array. Hope that helps. - Hutch