Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Geo::ShapeFile memory problem

by rhzhang (Initiate)
on Apr 16, 2017 at 21:09 UTC ( [id://1188065]=perlquestion: print w/replies, xml ) Need Help??

rhzhang has asked for the wisdom of the Perl Monks concerning the following question:

Perl newbie here trying to make maps. I've written a script to extract the US census block information from the TIGER Line Shapefiles (ESRI format) into a plain tab-delimited spreadsheet. I'm on macOS Sierra. Specifically, I used Jason Kohles' Geo::ShapeFile module to read the shapefile, and copy each Block ID and its associated polygon (coordinates) into a two-column, tab-delimited text file.

Here is the code:

#!/usr/bin/perl use strict; use warnings; use Math::Round; use Geo::ShapeFile; use Geo::ShapeFile::Shape; use Geo::ShapeFile::Point; # Declare variables my $blockid; my $bcount; my $totalblocks; my $progress; my $shapefile; # used for loading the TIGER shapefile my %attr; # used to read the dbf file my $polygon; # used to read the shp file my @points; # used to read the shape into an array of points my $coor; # used to read the coordinates of each point my $x; my $y; my $pcount; # counter for the coordinates being written into output my $totalpoints; # total number of coordinates in each polygon # Variables for time keeping my $start = time; my $duration; # Prepare the TIGER spreadsheet. open (TIGER, ">tiger.txt") or die "Failed to open"; print TIGER "BLOCKID\tPolygon\n"; # make a header # Load the TIGER shapefile $shapefile = 0; $shapefile = Geo::ShapeFile->new ("tabblock2010_42_pophu"); # use the +TIGER census block special release (population & housing units only) $totalblocks = $shapefile->shapes(); for (1 .. $totalblocks) { %attr = $shapefile->get_dbf_record($_); $blockid = $attr{BLOCKID10}; $bcount++; print TIGER "$blockid\t\"geometry\":{\"type\":\"Polygon\",\"coordi +nates\":[["; # The polygon column is in the GEOJSON format $polygon = $shapefile->get_shp_record($_); @points = $polygon->points(); $pcount = 0; $totalpoints = scalar (@points); foreach $coor (@points) { $x = $coor->get_x(); $y = $coor->get_y(); print TIGER "[$x,$y]"; $pcount++; if ($pcount != $totalpoints) { # All but the last coordinate s +hould be followed by a comma print TIGER ","; } } print TIGER "]]}\n"; # finish the entry print "Done $bcount out of $totalblocks.\n"; %attr = (); $blockid = 0; $polygon = 0; @points = (); $totalpoints = 0; } close (TIGER); $shapefile = 0; select()->autoflush(1); # Show stopwatch $duration = time - $start; $duration = round ($duration / 60); print "\n\nThis script took $duration minutes\n\n"; exit;

This has been working well for smaller states like Connecticut. But when I run this for a large shapefile, the program accumulates a huge amount of memory, and cannot go to the end of the script. I have a piece at the end of the script that prints a stopwatch (execution time). For example, Pennsylvania has 421,545 census blocks, the script will take about 7 minutes to run, and it will display:

... Done 421543 out of 421545. Done 421544 out of 421545. Done 421545 out of 421545.

And it will get stuck there, never finishing and never displaying the stopwatch. And in my Activity Monitor will say perl is taking 7 GB of memory. But my output file is perfectly fine. Geo::ShapeFile doesn't have a close file command, so I suspect there's some problem with the ShapeFile remaining open and taking up all of the memory.

Replies are listed 'Best First'.
Re: Geo::ShapeFile memory problem
by huck (Prior) on Apr 17, 2017 at 00:22 UTC

    in Re: Geo::ShapeFile memory problem he was on the right track but maybe not as strong as needed. try this

    $shapefile = Geo::ShapeFile->new ("tabblock2010_42_pophu"); $shapefile->caching( shp => 0); $shapefile->caching( dbf => 0); $shapefile->caching( shx => 0); $shapefile->caching( shapes_in_area => 0);
    The dbf data is probably the smallest of them all.

    Ive been playing, that worked for me. Mem usage is nice and flat now

      This works. Thanks!
Re: Geo::ShapeFile memory problem
by swl (Parson) on Apr 17, 2017 at 05:00 UTC

    I've just uploaded a new dev version of Geo::ShapeFile to CPAN which allows users to turn caching off.

    https://metacpan.org/release/SLAFFAN/Geo-ShapeFile-2.63_001

    Once that's installed, change the object creation line to be:

    $shapefile = Geo::ShapeFile->new ("tabblock2010_42_pophu", {no_cache => 1});

    It's an all-or-nothing approach, but should be sufficient for this use case. I might in future try to add a maximum cache size option.

    Other ideas and pull requests welcome.

    Shawn

      Being able to read right from one of these zip files without needing to unzip it to disk first would be a treat, but i realize it maybe a little too much to ask for.

      In that kind of mode requiring me to read it sequentially would be no problem, i'd expect that kind of processing to be kinda common ie

      my $dir0='h:/active/tiger_data'; my $sfips='42'; # pa # make $base->{$blockid10}{color} via # zipbyline reads a zip member sequentially # as in http://search.cpan.org/~phred/Archive-Zip-1.59/lib/Archive/Z +ip.pm#Low-level_member_data_reading # my $sn=$fips2state->{$sfips}.'2010.sf1'; # my $zf=$dir0.'/sf1/'.$sn.'.zip'; # my $mf=$fips2state->{$fips}.'geo2010.sf1'; ; # my $member=zipbyline_start($zf,$mf); # while (my $line=zipbyline_read($member)){ # ... pull out datums AREALAND AREAWATR POP100, create density # } # line # zipbyline_close($member); # sort by density, total POP100, # break into deciles, # assign a decile color to each $base->{$blockid10}{color} my $dir=$dir0.'/shapes'; my $state='tabblock2010_'.$sfips.'_pophu'; my $shapefn=$dir.'/'.$state.'/'.$state; my $imgfn=$dir0.'/gifs/'.$state.'.gif'; # this points to the unzipped dir now, # be nice to just point to $dir.'/'.$state.'.zip' instead my $sf = Geo::ShapeFile->new ($shapefn); $sf->caching(shp => 0); $sf->caching(dbf => 0); $sf->caching(shx => 0); $sf->caching(shapes_in_area => 0); my $x_min=$sf->x_min(); my $x_max=$sf->x_max(); my $y_max=90-$sf->y_min(); # need to invert 90 is top 0 is bot my $y_min=90-$sf->y_max(); my $totalblocks = $sf->shapes(); # $totalblocks=5000; my $xsize=$x_max-$x_min; my $ysize=$y_max-$y_min; my $imgy=5000; my $yscale=$imgy/$ysize; my $pfx=-0.00923452628555483*($sf->y_min)+ 1.15467278754118; # proje +ction factor my $imgx=$yscale*$xsize*$pfx; my $xscale=$imgx/($xsize); sub xproj { return (($_[0])-($x_min))*$xscale; } sub yproj { return ((90-$_[0])-$y_min)*$yscale; } # create a new image my $im = new GD::Image($imgx+1,$imgy+1); for my $si (1 .. $totalblocks) { my %attr = $sf->get_dbf_record($si); my $blockid10 = $attr{BLOCKID10}; my $color=$base->{$blockid10}{color}; unless ($color) {$color=$yellow;} my $polygon = $sf->get_shp_record($si); for my $pi (1 .. $polygon->num_parts) { my $part = $polygon->get_part($pi); my $poly = new GD::Polygon; for my $hash (@$part){ $poly->addPt(xproj($hash->{X}),yproj($hash->{Y})); } my $first=$part->[0]; my $last =$part->[-1]; if ($first->{X} ne $last->{X} || $first->{Y} ne $last->{Y} ) { + $poly->addPt(xproj($first->{X}),yproj($first->{Y})); } $im->filledPolygon($poly,$color); } # pi } # si outlines (); open (my $img,'>',$imgfn); binmode $img;print $img $im->gif;close $i +mg; exit; sub outlines { my $state='tl_2010_'.$sfips.'_county10'; my $shapefn=$dir.'/'.$state.'/'.$state; # this too points at an unzipped dir, # be nice to point at $dir.'/'.$state.'.zip' instead my $sf = Geo::ShapeFile->new ($shapefn); $sf->caching(shp => 0); $sf->caching(dbf => 0); $sf->caching(shx => 0); $sf->caching(shapes_in_area => 0); my $totalblocks = $sf->shapes(); for my $si (1 .. $totalblocks) { my $polygon = $sf->get_shp_record($si); for my $pi (1 .. $polygon->num_parts) { my $part = $polygon->get_part($pi); my $poly = new GD::Polygon; for my $hash (@$part){ $poly->addPt(xproj($hash->{X}),yproj($hash->{Y})); } my $first=$part->[0]; my $last =$part->[-1]; if ($first->{X} ne $last->{X} || $first->{Y} ne $last->{Y} ) { + $poly->addPt(xproj($first->{X}),yproj($first->{Y})); } $im->openPolygon($poly,$black); } # pi } # si }
      PA output at this sendspace link while it lasts
      MO output at this sendspace link while it lasts

        Thanks Huck,

        Direct extraction from a zip file is potentially useful, but in my experience not a common use case. (Although perhaps that's because there are no tools to do so...).

        Maybe there is a module that supports reading from archives as a file handle? Archive::Zip has readFromFileHandle() but that would need special handling in Geo::Shapefile each time data are accessed from file.

        Also, one thing to watch for in any plotting code is holes in the polygons. I don't think the Tiger data have holes, but in the shapefile spec they are implied by vertex order instead of being explicitly flagged. https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

        Shawn.

      Great. Thank you.
Re: Geo::ShapeFile memory problem
by 1nickt (Canon) on Apr 16, 2017 at 21:22 UTC

    I cannot discern what is eating your memory, but one thing you should do is move your variables declarations into the smallest possible scope. This will remove a lot of cruft from your code which may make it easier to spot the issue.

    Eg:

    my $shapefile = Geo::ShapeFile->new ("tabblock2010_42_pophu"); my $totalblocks = $shapefile->shapes(); for (1 .. $totalblocks) { my %attr = $shapefile->get_dbf_record($_); my $blockid = $attr{BLOCKID10}; ...
    Removing the declarations in the main package, and the statements to clear the variables at the bottom of your loop.

    Hope this helps !


    The way forward always starts with a minimal test.
Re: Geo::ShapeFile memory problem
by Anonymous Monk on Apr 16, 2017 at 21:57 UTC

    Hi,

    I dont see a mention of Scalar::Util::weaken in either Geo::ShapeFile or Tree::R , so thats an obvious bug, as it seems each leaf has a ->{root}, and thats a cycle

    But I didn't check by running code

Re: Geo::ShapeFile memory problem
by Anonymous Monk on Apr 16, 2017 at 22:35 UTC
    I don't know what the previous anonymonk talks about, but Geo::ShapeFile caches data, which probably eats a lot of memory with your 421545 records. Try $shapefile->caching('dbf', 0) (appears to be undocumented). Also, what 1nickt said.

      I've come to the same conclusion, and have opened an issue on github.

      https://github.com/shawnlaffan/Geo-ShapeFile/issues/19

      Test::LeakTrace also shows some leaked scalars (one per shape) that needs further investigation.

      Shawn.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1188065]
Approved by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-19 20:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found