http://qs321.pair.com?node_id=289250

waxmop has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse this page and create a hash using the codes as keys and the descriptions as values.

This is a section showing what the page looks like:

Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333

I want to build a hash that would work like this:

my $code = "GMF"; print "$code: $description_hash{$code}.\n";
That should print:
GMF: Manufacturing (NAICS).

All preceding and trailing whitespace needs to be removed from the description.

I've never been expert with regular expressions, so I'd love to see how the really smart people that hang out on this site would build that hash. Thanks in advance!

Replies are listed 'Best First'.
Re: Help me write a good reg-exp for this text
by hardburn (Abbot) on Sep 05, 2003 at 15:59 UTC

    Is the data of fixed width? If so, you're better off with unpack:

    # @LIST_OF_ENTRIES contains your data, one line per element my %description_hash; foreach my $entry (@LIST_OF_ENTRIES) { # Replace 20 with the number of characters in the value portion my ($value, $name) = unpack('A20 A*', $entry); $value =~ s/\A\s*//; $value =~ s/\s*\z//; $description_hash($name} = $value; }

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

      Hi - Thanks for the help. I don't understand the \A in s/\A\s*// however. Can you explain it to me?
        $ perldoc perlre
        The \A anchors the search at the beginning of the string. A ^ anchors a search at the beginning of the LINE in a string. Documentation is a wonderful thing.

        --
        [ e d @ h a l l e y . c c ]

Re: Help me write a good reg-exp for this text
by benn (Vicar) on Sep 05, 2003 at 16:07 UTC
    As this looks like fixed-length formatting, I wouldn't go with a regex at all, but with...
    my ($desc,$code) = unpack("A60 A*",$line);#or however long 1st field i +s. $description_hash{$code}=$desc;
    Cheers, Ben. Update: as would hardburn :)
Re: Help me write a good reg-exp for this text
by BrowserUk (Patriarch) on Sep 05, 2003 at 16:35 UTC

    This does the trick.

    #! perl -slw use strict; my %desc; m[^\s*(.*?)\s+([A-Z0-9]+)\s+(?:[0-9,]+\s*)?$] and $desc{ $2 } = $1 while <DATA>; my $code = 'GMF'; print "$code : $desc{ $code }"; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Help me write a good reg-exp for this text
by broquaint (Abbot) on Sep 05, 2003 at 16:10 UTC
    If you don't mind loosing the numbers at the end of some of the fields (or at least storing them elsewhere) you could just use a split
    use strict; my $str = <<TXT; Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 TXT my(%hash, %numbers); for(split "\n" => $str) { my @fields = reverse split; $numbers{$fields[1]} = shift @fields if $fields[0] =~ /\d(?:,\d+)?/; $hash{$fields[0]} = join ' ' => reverse @fields[1 .. $#fields]; } my $code = 'GMF'; print "$code: $hash{$code}.\n"; __output__ GMF: Manufacturing (NAICS).
    So that should give you the hash you want.
    HTH

    _________
    broquaint

      That doesn't work. Many of the text descriptions have spaces in them, plus there's spaces at the beginning of most of the lines.

      As this is fixed length, use unpack. Now, you're going to have to also use some logic if you care about the indenting stuff to make sure that you don't keep that whitespace at the beginning of your description. If you don't, it's easy enough to strip off the indenting whitespace. To get what you exactly wanted, do something like:

      # Change these to the actual column widths. Use a star at the end to g +et the rest. my @column_widths = (###, ###, '*'); my $unpack_spec = join ' ', map { "A$_" } @column_widths; my %codes; while (<IN_FILE>) { chomp; my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; # If you want to remove the pre-pended whitespace on the descripti +on ... $desc =~ s/^\s+//; $codes{$code} = { Description => $desc, Other_Thing => $other_thingy, }; } my $choice = 'GMT'; print "$choice: $codes{$choice}{Description}\n";

      ------
      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Help me write a good reg-exp for this text
by shenme (Priest) on Sep 05, 2003 at 16:44 UTC
    The example key you give ('GMF') to show what you want to key on is perhaps too simple? What do you want to use for one of your more complicated lines like: (some spaces removed)
        Iron and steel products       G3311A2     3311,2

    What would you be wanting to use as the key? I might imagine at least three possibilities:

    • everything in those fixed columns, thus 'G3311A2     3311,2' including the spaces between the strings,
    • just one part of all that, such as just 'G3311A2',
    • both parts individually as alternate keys 'G3311A2' and '3311,2'

    dragonchild uses $other_thingy to capture the '3311,2' separately. But what really should be done with that part?

      You're right; I was ambiguous. For this line:

      Iron and steel products G3311A2 3311,2
      The key should be 'G3311A2' and the value should be 'Iron and steel products'. The '3311,2' information is not needed by me.
        So if the data format really _is_ fixed-width columns then something like dragonchild's code would work, using
        my @column_widths = (57, 17, '*');
        for the widths (check against the real column widths).   Although to remove the leading _and_ trailing spaces from each piece I'd do something like:
        my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; foreach my $piece ($desc, $code, $other_thingy) { $piece =~ s/^\s+//; $piece =~ s/\s+$//; }
        (I think that's right, hmmm, testing with dragonchild's modified code ....)
        # Change these to the actual column widths. Use a star at the end to g +et the rest. my @column_widths = ( 57, 17, '*'); my $unpack_spec = join ' ', map { "A$_" } @column_widths; my %codes; while (<DATA>) { chomp; my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; foreach my $piece ($desc, $code, $other_thingy) { $piece =~ s/^\s+//; $piece =~ s/\s+$//; } $codes{$code} = { Description => $desc, Other_Thing => $other_thingy, }; } my $choice = 'GMF'; print "$choice: $codes{$choice}{Description}\n"; $choice = 'G3311A2'; print "$choice: $codes{$choice}{Description}\n"; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 _ _ OUTPUT _ _ GMF: Manufacturing (NAICS) G3311A2: Iron and steel products
Re: Help me write a good reg-exp for this text
by graff (Chancellor) on Sep 06, 2003 at 02:45 UTC
    Here's one more (regex only) approach, which no one seems to have tried yet. I'm basing this on the assumption that the second column (which contains your intended hash key) is always separated from the first column by at least three spaces, whereas word separations within the first column are always single spaces:
    #!/usr/bin/perl -w use strict; my %hash; while (<DATA>) { s/^\s+//; # remove leading whitespace if ( /(.*?)\s{3,}(\S+)/ ) { my ($val,$key) = ($1,$2); $hash{$key} = $val; } } print map { "$_ : $hash{$_}\n" } sort keys %hash; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 __OUTPUT__ B00004 : Manufacturing ("SIC") B50001 : Total index B5610C : Crude processing (capacity) B562A3C : Primary & semifinished processing (capacity) B5640C : Finished processing (capacity) G321 : Wood product G327 : Nonmetallic mineral product G331 : Primary metal G3311A2 : Iron and steel products G332 : Fabricated metal product G333 : Machinery GMF : Manufacturing (NAICS) GMFD : Durable manufacturing (NAICS)
    I have to confess, I've been pretty slow to get comfortable with unpack(), myself. It is certainly one of the more difficult functions to grasp (and its description in perfunc is still a bit hard to follow).