Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parsing data from a report not meant for machine readability.

by Trizor (Pilgrim)
on May 09, 2007 at 22:35 UTC ( [id://614509]=perlquestion: print w/replies, xml ) Need Help??

Trizor has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks

Hurricane season is upon us, almost a month early, and I got the idea to write a hurricane tracking charter in Perl with GD or something similar.

The problem is that NCEP Forecast Advisories, while somewhat regular, aren't available in a slightly more machine friendly format.

I haven't started writing a prototype parser for the advisories yet, but right now I'm looking at a very big regex to pull all the data out of one report at once, or break it into sections and pull the data out with a series of slightly smaller regexen. If any monks know of a better way, I'd love to learn it as this regex just feels like its going to be a big headache.

The NHC is kind enough to provide a how to read guide detailing the fields, but it has several areas of amgiuity meant for a human to resolve. Here's to hoping perl is human after all.

ZCZC MIATCMAT1 ALL TTAA00 KNHC DDHHMM SUBTROPICAL STORM ANDREA FORECAST/ADVISORY NUMBER 2 NWS TPC/NATIONAL HURRICANE CENTER MIAMI FL AL012007 2100 UTC WED MAY 09 2007 A TROPICAL STORM WATCH REMAINS IN EFFECT ALONG THE SOUTHEAST COAST OF THE UNITED STATES FROM ALTAMAHA SOUND GEORGIA SOUTHWARD TO FLAGLER BEACH FLORIDA. A TROPICAL STORM WATCH MEANS THAT TROPICAL STORM CONDITIONS ARE POSSIBLE WITHIN THE WATCH AREA...GENERALLY WITHIN THE NEXT 36 HOURS. SUBTROPICAL STORM CENTER LOCATED NEAR 30.8N 80.1W AT 09/2100Z POSITION ACCURATE WITHIN 30 NM PRESENT MOVEMENT TOWARD THE WEST OR 265 DEGREES AT 4 KT ESTIMATED MINIMUM CENTRAL PRESSURE 1003 MB MAX SUSTAINED WINDS 40 KT WITH GUSTS TO 50 KT. 34 KT.......100NE 100SE 0SW 0NW. 12 FT SEAS..120NE 90SE 0SW 120NW. WINDS AND SEAS VARY GREATLY IN EACH QUADRANT. RADII IN NAUTICAL MILES ARE THE LARGEST RADII EXPECTED ANYWHERE IN THAT QUADRANT. REPEAT...CENTER LOCATED NEAR 30.8N 80.1W AT 09/2100Z AT 09/1800Z CENTER WAS LOCATED NEAR 30.9N 80.0W FORECAST VALID 10/0600Z 30.6N 80.6W MAX WIND 35 KT...GUSTS 45 KT. 34 KT...100NE 100SE 0SW 60NW. FORECAST VALID 10/1800Z 30.2N 80.8W MAX WIND 35 KT...GUSTS 45 KT. 34 KT...100NE 100SE 0SW 60NW. FORECAST VALID 11/0600Z 29.8N 80.9W MAX WIND 30 KT...GUSTS 40 KT. FORECAST VALID 11/1800Z 29.5N 80.9W MAX WIND 30 KT...GUSTS 40 KT. FORECAST VALID 12/1800Z 29.5N 80.9W...DISSIPATING MAX WIND 25 KT...GUSTS 35 KT. EXTENDED OUTLOOK. NOTE...ERRORS FOR TRACK HAVE AVERAGED NEAR 250 NM ON DAY 4 AND 325 NM ON DAY 5...AND FOR INTENSITY NEAR 20 KT EACH DAY OUTLOOK VALID 13/1800Z...DISSIPATED REQUEST FOR 3 HOURLY SHIP REPORTS WITHIN 300 MILES OF 30.8N 80.1W NEXT ADVISORY AT 10/0300Z $$ FORECASTER KNABB NNNN

Replies are listed 'Best First'.
Re: Parsing data from a report not meant for machine readability.
by GrandFather (Saint) on May 09, 2007 at 23:13 UTC

    Id be inclined to break the parsing task up into chunks according to the sections described in the link. Something along the lines of the following:

    use warnings; use strict; my $data = <<FORCAST;
    FORCAST my $forcast = bless {lines => [split /\n/, $data]}; $forcast->ExtractData (); print $forcast->{date}; sub ExtractData { my $self = shift; chomp @{$self->{lines}}; $self->WMOHeader (); #... } sub WMOHeader { my $self = shift; my @header = splice @{$self->{lines}}, 0, 6; # Flush trailing empty lines shift @{$self->{lines}} while @{$self->{lines}} and $self->{lines} +[0] =~ /^\s*$/; $self->{WMODoc} = $header[1]; $self->{NWSDoc} = $header[2]; $self->{advisory} = $header[3]; $self->{id} = $header[4]; $self->{date} = $header[5]; } sub News { my $self = shift; #... } #...

    Prints:

    1500Z TUE SEP 16 2003

    you'd probably want to be a little fussier about matching fields and sub fields than the sample WMOHeader indicates, but the general structure makes it fairly easy to maintain each section in isolation and with a suitable test suite test the parsing of each section in isolation.


    DWIM is Perl's answer to Gödel
Re: Parsing data from a report not meant for machine readability.
by chrism01 (Friar) on May 09, 2007 at 23:12 UTC
    I'd want to want to look at a few advisories to see which bits are constant and which bits are variable, then I'd hopefully be able to identify sub-sections by it's 1st line.
    It also appears that each sub-section is separated by blank lines, so that's also a hint. I did something similar once or twice in the past and used a separate subroutine for each sub-section.
    If you ask them nicely, they may even have a format definition doc. It certainly looks like a format that's designed to be readable by humans and programs.
    Cheers
    Chris
Re: Parsing data from a report not meant for machine readability.
by monarch (Priest) on May 10, 2007 at 00:00 UTC
    On another tangent, perhaps you could consider an automated tool that presents a marked-up version of the advisory, for a human to peruse and approve or disprove before automatic updating.

    Essentially, upon reception of an advisory, a pre-tool attempts to parse as much as possible, and then marks up the original report in varying colours etc to show exactly what the pre-tool understood. If the human approves this, then the data extracted from the pre-tool goes straight into the post-tool (which does the graphing).

    It's a compromise between entirely-human oriented data fetching, and entirely-automated processing. At least in the short term it would give you confidence in the processing quality of your new tool.

      I am not a Perl programmer, but rather PHP. I just finished the same task. If you've already figured this out, congrats! If not, here are some hints. First thing to keep in mind is what data are you hoping to extract? The watches and warnings issued for storms are not standard in this or any of the other standard package. There is a new form the NHC publishes when watches/warnings are issued. I do not know the header but they can be found on the NHC's website. The other thing is with the wind fields and forecasts. All wind fields will not be present (if a storm has 40kt winds, they will not issue wind fields for 50kts and 64 kts respectively). Same thing with the forecasted wind fields. Not all forecasts will have all the usual wind fields anyway. i.e. even if a storm is expected to have 100kt winds in 3 days, the 3 day forecast will only list 50kt wind radii. Make sense? Regarding the forecasts, there will not always be all five time periods for obvious reasons: if the storm makes landfall or is expected to dissipate. You should check your forecast line to make sure it contains valid data. If not, just break out of that portion of your script. Other items are optional: the eye diamter for example. For a complete advisory, I used Katrina (2005), advisory 13. That forecast/advisory lists all the fields with valid data. Also, be careful about corrected advisories. You may get an advisory just issued five minutes ago and then five minutes from you getting hte info, the NHC will correct something. Could be winds, lat, lon, anything vital. If your running a chron job, run it continuously till at least 30 minutes after the scheduled advisory release (0300Z, 0900Z, 1500Z and 2100Z (Z is same as UTC)). Other than that, most everything is pretty simple. When I get my site up, I do plan on posting this info as an XML file for other programs/sites to use. Let me know if you're interested. I don't/won't charge for it but an acknowledgement would be nice!
        I guess I could have left my email (forgot I don't post in these forums!!!). You can write me at tim.trice@gmail.com. The website I'm building is personalhurricanecenter.com.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://614509]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-20 01:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found