Help me write a good reg-exp for this text

waxmop has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse this page and create a hash using the codes as keys and the descriptions as values.

This is a section showing what the page looks like:

Total index                                              B50001
 Crude processing (capacity)                             B5610C
 Primary & semifinished processing (capacity)            B562A3C
 Finished processing (capacity)                          B5640C
 Manufacturing ("SIC")                                   B00004
  Manufacturing (NAICS)                                  GMF
   Durable manufacturing (NAICS)                         GMFD
    Wood product                                         G321         
+    321
    Nonmetallic mineral product                          G327         
+    327
    Primary metal                                        G331         
+    331
     Iron and steel products                             G3311A2      
+    3311,2
    Fabricated metal product                             G332         
+    332
    Machinery                                            G333         
+    333
[download]

I want to build a hash that would work like this:

my $code = "GMF";

print "$code: $description_hash{$code}.\n";
[download]

That should print:

GMF: Manufacturing (NAICS).
[download]

All preceding and trailing whitespace needs to be removed from the description.

I've never been expert with regular expressions, so I'd love to see how the really smart people that hang out on this site would build that hash. Thanks in advance!

Comment on Help me write a good reg-exp for this text Select or Download Code

Replies are listed 'Best First'.
Re: Help me write a good reg-exp for this text by hardburn (Abbot) on Sep 05, 2003 at 15:59 UTC
Is the data of fixed width? If so, you're better off with `unpack`: `# @LIST_OF_ENTRIES contains your data, one line per element my %description_hash; foreach my $entry (@LIST_OF_ENTRIES) { # Replace 20 with the number of characters in the value portion my ($value, $name) = unpack('A20 A', $entry); $value =~ s/\A\s//; $value =~ s/\s\z//; $description_hash($name} = $value; }` [download] ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.* -- Schemer Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Care to explain s/\A\s*//? by waxmop (Beadle) on Sep 05, 2003 at 16:59 UTC
Hi - Thanks for the help. I don't understand the \A in s/\A\s*// however. Can you explain it to me?	[reply]
Re: Care to explain s/\A\s*//? by halley (Prior) on Sep 05, 2003 at 17:22 UTC
`$ perldoc perlre` [download] The \A anchors the search at the beginning of the string. A ^ anchors a search at the beginning of the LINE in a string. Documentation is a wonderful thing. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re: Re: Care to explain s/\A\s*//? by bart (Canon) on Sep 05, 2003 at 17:46 UTC
Re: Help me write a good reg-exp for this text by benn (Vicar) on Sep 05, 2003 at 16:07 UTC
As this looks like fixed-length formatting, I wouldn't go with a regex at all, but with... `my ($desc,$code) = unpack("A60 A*",$line);#or however long 1st field i +s. $description_hash{$code}=$desc;` [download] Cheers, Ben. Update: as would hardburn :)	[reply] [d/l]
Re: Help me write a good reg-exp for this text by BrowserUk (Patriarch) on Sep 05, 2003 at 16:35 UTC
This does the trick. #! perl -slw use strict; my %desc; m[^\s(.?)\s+([A-Z0-9]+)\s+(?:[0-9,]+\s*)?$] and $desc{ $2 } = $1 while <DATA>; my $code = 'GMF'; print "$code : $desc{ $code }"; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]
Re: Help me write a good reg-exp for this text by broquaint (Abbot) on Sep 05, 2003 at 16:10 UTC
If you don't mind loosing the numbers at the end of some of the fields (or at least storing them elsewhere) you could just use a `split` use strict; my $str = <<TXT; Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 TXT my(%hash, %numbers); for(split "\n" => $str) { my @fields = reverse split; $numbers{$fields[1]} = shift @fields if $fields[0] =~ /\d(?:,\d+)?/; $hash{$fields[0]} = join ' ' => reverse @fields[1 .. $#fields]; } my $code = 'GMF'; print "$code: $hash{$code}.\n"; __output__ GMF: Manufacturing (NAICS). [download] So that should give you the hash you want. HTH `_________ broquaint`	[reply] [d/l]
Re2: Help me write a good reg-exp for this text by dragonchild (Archbishop) on Sep 05, 2003 at 16:19 UTC
That doesn't work. Many of the text descriptions have spaces in them, plus there's spaces at the beginning of most of the lines. As this is fixed length, use unpack. Now, you're going to have to also use some logic if you care about the indenting stuff to make sure that you don't keep that whitespace at the beginning of your description. If you don't, it's easy enough to strip off the indenting whitespace. To get what you exactly wanted, do something like: # Change these to the actual column widths. Use a star at the end to g +et the rest. my @column_widths = (###, ###, ''); my $unpack_spec = join ' ', map { "A$_" } @column_widths; my %codes; while (<IN_FILE>) { chomp; my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; # If you want to remove the pre-pended whitespace on the descripti +on ... $desc =~ s/^\s+//; $codes{$code} = { Description => $desc, Other_Thing => $other_thingy, }; } my $choice = 'GMT'; print "$choice: $codes{$choice}{Description}\n"; [download] ------ We are the carpenters and bricklayers of the Information Age.* The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Help me write a good reg-exp for this text by shenme (Priest) on Sep 05, 2003 at 16:44 UTC
The example key you give ('GMF') to show what you want to key on is perhaps too simple? What do you want to use for one of your more complicated lines like: (some spaces removed) `Iron and steel products G3311A2 3311,2` What would you be wanting to use as the key? I might imagine at least three possibilities: everything in those fixed columns, thus 'G3311A2 3311,2' including the spaces between the strings, just one part of all that, such as just 'G3311A2', both parts individually as alternate keys 'G3311A2' and '3311,2' dragonchild uses `$other_thingy` to capture the '3311,2' separately. But what really should be done with that part?	[reply] [d/l]
Re: Re: Help me write a good reg-exp for this text by waxmop (Beadle) on Sep 05, 2003 at 17:06 UTC
You're right; I was ambiguous. For this line: `Iron and steel products G3311A2 3311,2` [download] The key should be 'G3311A2' and the value should be 'Iron and steel products'. The '3311,2' information is not needed by me.	[reply] [d/l]
Re: Re: Re: Help me write a good reg-exp for this text by shenme (Priest) on Sep 05, 2003 at 17:32 UTC
So if the data format really _is_ fixed-width columns then something like dragonchild's code would work, using `my @column_widths = (57, 17, '');` [download] for the widths (check against the real column widths). Although to remove the leading _and_ trailing spaces from each piece I'd do something like: `my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; foreach my $piece ($desc, $code, $other_thingy) { $piece =~ s/^\s+//; $piece =~ s/\s+$//; }` [download] (I think that's right, hmmm, testing with dragonchild's modified code ....) # Change these to the actual column widths. Use a star at the end to g +et the rest. my @column_widths = ( 57, 17, ''); my $unpack_spec = join ' ', map { "A$_" } @column_widths; my %codes; while (<DATA>) { chomp; my ($desc, $code, $other_thingy) = unpack $unpack_spec, $_; foreach my $piece ($desc, $code, $other_thingy) { $piece =~ s/^\s+//; $piece =~ s/\s+$//; } $codes{$code} = { Description => $desc, Other_Thing => $other_thingy, }; } my $choice = 'GMF'; print "$choice: $codes{$choice}{Description}\n"; $choice = 'G3311A2'; print "$choice: $codes{$choice}{Description}\n"; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 _ _ OUTPUT _ _ GMF: Manufacturing (NAICS) G3311A2: Iron and steel products [download]	[reply] [d/l] [select]
Re: Help me write a good reg-exp for this text by graff (Chancellor) on Sep 06, 2003 at 02:45 UTC
Here's one more (regex only) approach, which no one seems to have tried yet. I'm basing this on the assumption that the second column (which contains your intended hash key) is always separated from the first column by at least three spaces, whereas word separations within the first column are always single spaces: #!/usr/bin/perl -w use strict; my %hash; while (<DATA>) { s/^\s+//; # remove leading whitespace if ( /(.*?)\s{3,}(\S+)/ ) { my ($val,$key) = ($1,$2); $hash{$key} = $val; } } print map { "$_ : $hash{$_}\n" } sort keys %hash; __DATA__ Total index B50001 Crude processing (capacity) B5610C Primary & semifinished processing (capacity) B562A3C Finished processing (capacity) B5640C Manufacturing ("SIC") B00004 Manufacturing (NAICS) GMF Durable manufacturing (NAICS) GMFD Wood product G321 + 321 Nonmetallic mineral product G327 + 327 Primary metal G331 + 331 Iron and steel products G3311A2 + 3311,2 Fabricated metal product G332 + 332 Machinery G333 + 333 __OUTPUT__ B00004 : Manufacturing ("SIC") B50001 : Total index B5610C : Crude processing (capacity) B562A3C : Primary & semifinished processing (capacity) B5640C : Finished processing (capacity) G321 : Wood product G327 : Nonmetallic mineral product G331 : Primary metal G3311A2 : Iron and steel products G332 : Fabricated metal product G333 : Machinery GMF : Manufacturing (NAICS) GMFD : Durable manufacturing (NAICS) [download] I have to confess, I've been pretty slow to get comfortable with unpack(), myself. It is certainly one of the more difficult functions to grasp (and its description in perfunc is still a bit hard to follow).	[reply] [d/l]

Back to Seekers of Perl Wisdom