phone number parsing refuses to work

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am getting so frustrated. This is a repost of two previous posts. I've tried every example given and to no success. They all include numbers that aren't phone numbers and miss most of the numbers that are. I tried doing ALL of them, even mix-matched with tr/0-9//cd; which I think is a bad idea because what this does (from what I THINK it does) is puts all numbers in a huge line and makes a number out of them. I can't do this because there are more numbers on a line than inside my phone number.

Here is some sample data (I have different log files, but here are two so you can see):

FILE1
Residential  MLS #: 2094044  Status: Active-NORMLS  LP: $125,204
SP: $ 
9962 BEVERLY LANE STREETSBORO  OH  44241-   Unit/Lot #:  Area: 1909   
+ 
 Unit Floor #:  Map Coordinate: P13A2  
Subdivision/Complex: VANTAGE POINT
Photos: Media:  6  Acres: 
1/2 Yr. Tax : 732  County: Portage
Owner/Agent: No  
Parcel ID# (PIN): TBA Year Built: 2003  Lot Dimensions: 18X52  
School District: 6709/Streetsboro City  List Type: ERS  Irregular: N  
High School:  MLS Cross Ref #:      
Sub Property Type: One Family  List Date: 6/20/2003
 MT: 253  
Directions: CORNER FROST RD & ST RT 43 
# Rooms: 4  # Bedrooms: 2  Total Baths: 1.1  Finished SqFt: 1080  
LO #/Name: 2380 / Realty One  (440) 248-2700  Office Web Site: www.rea
+ltyone.com  
LA #/Name: 417391 / Mark J. Abbott  (440) 975-0537  LA Email: m.abbott
+@realtyone.com  
LA 2 #/Name: /   LA 2 Email:  
SAC: 0  BAC: 2.5  OAC: None  LockBox Desc:  
Compensation Explain:  Fixer Upper: N 
Remarks: WILLIAM THOMAS HOMES VANTAGE PT CLUSTER TOWNHOMES! TWO BEDROO
+MS,ONE & HALF BATHS,FULL BASEMENT! FIREPLACE! KITCHEN & LAUNDRY APPLI
+ANCES! WOOD RAILINGS! COMMON AREA MAINTENANCE! 56 HILLSIDE & PATIO UN
+ITS, TAXES ESTIMATED, EXTRA WINDOWS! 90% EFFIC FURNACE! PRIVACY FENCE
+! PATIO.FURNISHED MODEL 9941 BEVERLY  
Broker Remarks: COMMISSION PAID ON BASE OF $114,900. CALL LISTING AGEN
+T FOR INFORMATION ON TITLE WORK. 

----------------------------------------------------------------------
+----------
 

Residential  MLS #: 2130518  Status: Active-NORMLS  LP: $125,500
SP: $ 
1244 Meadow Run Copley  OH  44321-   Unit/Lot #: 20  Area: 1820    
 Unit Floor #:  Map Coordinate: S27B3  
Subdivision/Complex: Meadows of Copley
Photos: Media:  1  Acres: 
1/2 Yr. Tax : 9999  County: Summit
Owner/Agent:  
Parcel ID# (PIN): 0 Year Built: 2004  Lot Dimensions:  
School District: 7703/Copley-Fairlawn City  List Type: ERS  Irregular:
+ N  
High School: Copley  MLS Cross Ref #:      
Sub Property Type: Condominium  List Date: 2/17/2004
 MT: 11  
Directions: Ridgewood Road to Jacoby Rd. to Copley Rd. east to The Mea
+dows 
# Rooms: 5  # Bedrooms: 2  Total Baths: 2.1  Finished SqFt:  
LO #/Name: 2817 / Smythe, Cramer Co.  (330) 836-9300  Office Web Site:
+ www.smythecramer.com  
LA #/Name: 302709 / Sheila Eaton  (330) 864-5741  LA Email: sheilaeato
+n45@aol.com  
LA 2 #/Name: /   LA 2 Email:  
SAC: 0  BAC: 2.5  OAC: None  LockBox Desc:  
Compensation Explain:  Fixer Upper: N 
Remarks: Beautiful new constructionin The Meadows of Copley*1st class 
+amenities*448 sq ft finished lower level family rm*Vaulted ceilings*F
+ully applianced*Spacious master suite*Bright, open and airy*10x10 pat
+io.  
Broker Remarks:  

----------------------------------------------------------------------
+----------
 




FILE2
 Donna I. Stoner, ABR GRI
Bolton-Johnston Associates of Grosse Pointe
Phone 1: (313)884-6400, Email: donnastoner@realtor.com
Buyers, Relocation, Residential, Sellers, Waterfront Property
  Add to Scratch Pad  
 Contact me now  
 Go to my site  

 
 

 DONNA L. GORMLEY
Johnstone & Johnstone
Office: (313) 884-0600, Mobile: (313) 590-9253, Email: johnstone@reale
+stateone.com
buyer's agent, Listing agent, residential properties
  Add to Scratch Pad  
 Contact me now  
 Go to my site
[download]

As you can see, on some lines I MAY have more than one set of numbers so I need it to be picky and only select things that are numbers. Someone suggested http but there is no documentation. It shows how to validate one variable, which I can't get to work much less how to trim an entire text file into numbers it'll validate.

This is frustrating me so much because I checked everything I could on Phone Numbers in the super search and nothing helped, they all died in one way or another. Can someone give me a different perspective or show how to use that module? My last attempt was:

#!/usr/bin/perl

use strict;


# change the below line to the file you are reading FROM (your junk fi
+le)
my $read_from = "test2.txt";

# Change the below line to where you want your neat phone numbers to b
+e printed
my $save_to   = "saved.txt";


my %seen;
open(FILE, '<', "$read_from") or die "Unable to open file.txt for read
+ing, $!";

while (<FILE>) {
    #s /[\n|\r]//g;
    tr/0-9//cd;

    #print "Testing with $_, result is ";

    m/(1[-| ]?)?\(?(\d{3})\)?[-| ]?(\d{3})[-| ]?(\d{4})/;
    #m|(1-)?\(?(\d{3})\)?-?(\d{3})-(\d{4})|;

    my $areacode = $2;
    my $exchange = $3;
    my $line = $4;

    print "($areacode) $exchange-$line\n";
$seen{"$areacode-$exchange-$line"}++;
}

close(FILE);


open(SAVED, '>', "$save_to") or die "Unable to open $!";
print SAVED "$_\n" for (sort keys %seen);
close(SAVED);
[download]

Edited by Chady -- formatting and readmore tags.

Comment on phone number parsing refuses to work Select or Download Code

Replies are listed 'Best First'.
Re: phone number parsing refuses to work by Happy-the-monk (Canon) on Mar 13, 2004 at 23:11 UTC
If I observed correctly, you are looking for numbers of these formats: (123) 234-3456 (123)234-3456 `m/ ( # start caption $ # open paranthesis \d{3} # 3 digits $ # close paranthesis \s? # 0 or 1 whitespace \d{3} # 3 digits \- # 1 dash line \d{4} # 4 digits ) # end caption /xg;` [download] Sören	[reply] [d/l]
Re: phone number parsing refuses to work by BrowserUk (Patriarch) on Mar 14, 2004 at 00:32 UTC
Pasting your sample data into the following 1-liner (wrapped for posting only) produced the following output. (Season to taste:) `perl -0777pe " s[[^0-9() -]+][\n]g; s[\s{3,}][\n]g; s[---][]g; for my$n(1..4){ s[\n.{1,7}\n][\n]msg;} print" - [PASTE SNIPPED] ^Z (440) 248-2700 (440) 975-0537 (330) 836-9300 (330) 864-5741 (313)884-6400 (313) 884-0600 (313) 590-9253 (440) 248-2700 (440) 975-0537 (330) 836-9300 (330) 864-5741 (313)884-6400 (313) 884-0600 (313) 590-9253` [download] If your data file is very large you would need to drop the slurp (-0777) which would mean that as-is, the filtering wouldn't be as effective. But the principle of first throwing away as much as possible (safetly--replacing with spaces or newlines so that you don't run good data together) is a useful first pass at extracting small pieces from large volumes. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: phone number parsing refuses to work by etcshadow (Priest) on Mar 14, 2004 at 01:57 UTC
Here ya go: `push(@list, "($1) $2-$3") while /$?(\d{3})?$?\s[-.]\s(\d{3})\ +s[-.]\s(\d{4})/g;` [download] I pasted your big block of text into it and got this out: `() 209-4044 (440) 248-2700 (440) 975-0537 () 213-0518 (330) 836-9300 (330) 864-5741 (313) 884-6400 (313) 884-0600 (313) 590-9253` [download] Good enough? Update: I should explain a little... The regexp breaks down into, basically: optional parens around optional 3 digits, minimal separating junk, 3 digits, minimal separating junk 4 digits. It may look ugly, but it's actually quite straight-forward, although you could obviously tinker with it a little if you wanted. Oh... and the `while ... /g` means to count it each time it appears on a line (so if multiple phone #'s are on one line, it'll count each one). For example, I'll turn the "separating garbage" chunk into just `[-.\s]`, which is more permissive as well as shorter to write out. Still gets the same results on your sample data. `[me@host]$ perl -ne 'push(@list, "($1) $2-$3") while /$?(\d{3})?$?[- +.\s](\d{3})[-.\s]*(\d{4})/g; END{print join("\n",@list)."\n";}' data +.txt () 209-4044 (440) 248-2700 (440) 975-0537 () 213-0518 (330) 836-9300 (330) 864-5741 (313) 884-6400 (313) 884-0600 (313) 590-9253 [me@host]$` [download] `------------ :Wq Not an editor command: Wq` [download]	[reply] [d/l] [select]
Re: phone number parsing refuses to work by graff (Chancellor) on Mar 14, 2004 at 00:33 UTC
Those input files are pretty noisy. If all you need to do is extract and print the phone numbers -- that is, if you don't need to associate each phone number with some name and/or address that's next to it in the data -- then it would help to pre-condition the text so as to eliminate all the stuff you know you don't need, and isolate the potential phone numbers to make them easier to pick out. Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning: `while (<>) # read one line at a time { s/[a-z;:\@]+//gi; # these aren't used for numbers s/(?<=\d\)) (?=\d)//g; # remove space in "\d) \d" # split the line on whitespace (that's why we got rid of # any spaces that might be within a given phone number); # for each thing coming out of the split, print it if it # looks like a phone number: for my $num ( split /\s+/ ) { next unless ( $num =~ /\D(\d{3})\D(\d{3})-(\d{4})\D/ ); print "$1-$2-$3\n"; } }` [download] That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)	[reply] [d/l]
Re: phone number parsing refuses to work by Anonymous Monk on Mar 13, 2004 at 23:19 UTC
The link disappeared, the module I want to try to use is Number/Phone/US.pm but there's no documentation for what I need to do. I need to parse an entire junk file and take all valid numbers OUT of it.	[reply]
Re: Re: phone number parsing refuses to work by Happy-the-monk (Canon) on Mar 13, 2004 at 23:29 UTC
`use strict; use Number::Phone::US qw(is_valid_number); my $data = <<'EOF'; all your data goes here EOF my @results = ( $data =~ m/ ( # start caption $ # open paranthesis \d{3} # 3 digits $ # close paranthesis \s? # 0 or 1 whitespace \d{3} # 3 digits \- # 1 dash line \d{4} # 4 digits ) # end caption /xg ); foreach ( @results ) { print "valid: $_\n" if is_valid_number( $_ ); }` [download] Sören	[reply] [d/l]
Re: phone number parsing refuses to work by converter (Priest) on Mar 14, 2004 at 15:54 UTC
This may be "off topic", but are these real data you've posted here? If so, bad form. I don't think the folks whose information appears here would appreciate it at all. I never allow customers' data to escape my network if they include information about specific companies or people, even if it's information that could be found in any phone book. In the future, you should take a few minutes to create dummy data for any examples you want to include in your posts. converter	[reply]
Re: phone number parsing refuses to work by mojotoad (Monsignor) on Mar 23, 2004 at 00:35 UTC
It is not necessarily going to help too much extracting phone numbers from surrounding prose, but the following may give you some ideas with dealing with parsing the numbers once you think you have one: Beast of the Number: Parsing the Feral Phone Matt	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks