Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Pull 3-digit and 4-digit numbers from string

by bitingduck (Chaplain)
on Apr 10, 2015 at 06:07 UTC ( [id://1123012]=note: print w/replies, xml ) Need Help??


in reply to Pull 3-digit and 4-digit numbers from string

When I do stuff like this I like to regularize the data by stripping out punctuation that makes things more complicated. In most of the US it's not too hard to determine if something is a phone number-- it will generally have 7,10, or 11 numerical digits (except inside companies' private exchanges and a few small towns like Volcano Village, HI) and some form of separators that depend on where whoever wrote it is from and what mood they were in when they wrote it. I included a little twist for extensions, which are usually appended as x\d+, where there may or may not be a space before the x.

The example below will strip out the punctuation that's around the numbers then check the length of any runs. If it's in the 7 to 11 range I declare it to be a phone number and anything else is part of an address.

#!/usr/bin/perl use strict; use warnings; use v5.10; my @numbers=('(123)456-7890', "222.222.2222", "1-313-345-6798","23-35 +Baker St. Apt 6", "666 666 6666", "123-345.5678", "45 elm street", "1 +23-345.5678x999", "666 666 6666 x233"); foreach my $number (@numbers){ #strip phone number punctuation: my $address=$number; $number =~ s/\(?(\d+)[-(). ](\d|x\d)/$1$2/g; if ($number=~m/\d{7,11}/){ # you could regularize phone number formatting in here say $number." Phone number"; } else { say $address." Address"; # process the number as an address $address =~ m/(\d+)/; say "address number $1"; } }

with output

1234567890 Phone number 2222222222 Phone number 13133456798 Phone number 23-35 Baker St. Apt 6 Address address number 23 6666666666 Phone number 1233455678 Phone number 45 elm street Address address number 45 1233455678x999 Phone number 6666666666x233 Phone number

Note that I got lazy and didn't bother pulling out all the numbers within an address string, which I let be lengths other than just your 3 & 4 digit runs. I also miss on numbers like 1-(800)-222-2222, but that's just a little more regex tweaking. I don't strip commas, since I don't think I've ever seen commas used to punctuate a US phone number. They might also be your big flag for lists of apt numbers. If you're dealing with phone numbers in Europe you're probably doomed-- they seem to have random numbers of digits over a very large range.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1123012]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-18 03:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found