http://qs321.pair.com?node_id=157483

sevrin has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to Perl, and am under the gun to get something figured out. If you'd like to increase your karma, help a novice out.

I need to parse phone numbers from a database. Since there was no input validation in the first place, all kinds of crazy things are in there. To wit:

(555) 555-5555
555.555.5555
555-555-5555
(555)555.5555

Here's where it gets fun:
(555) 555-5555 x.555
555.555.5555 Ext. 555
555-555-5555 ext.555

And so on.

My database has a table for phone numbers which has, among other fields, one for the number (tbl_phone.number), and one for the extension (tbl_phone.extension). It seems like what I want to do to grab this data is to put the first 7 things that are digits into $1, and anything that follows that is a digit into $2. This allows me to stuff $1 into tbl_phone.number using a standardized format, as well as putting $2 into tbl_phone.extension with numerals only.

The trouble is that I do not have much experience with regular expressions, so I am reading the camel and ram books to figure it out.

Thanks in advance. I really appreciate the help.

/Scott

Replies are listed 'Best First'.
Re: Simple Regex
by Juerd (Abbot) on Apr 08, 2002 at 17:12 UTC

    You love "5", don't you? :)

    I'd probably filter all non-alphanumerics, and then split on any string of letters.

    while (<DATA>) { chomp; tr/A-Za-z0-9//cd; my ($number, $extension, $overflow) = split /[A-Za-z]+/; if ($overflow) { warn "Don't know how to handle number '$_'.\n"; next; } print "Number: $number"; print ", extension: $extension" if defined $extension and length $ +extension; print "\n"; } __DATA__ (555) 555-5555 555.555.5555 555-555-5555 (555)555.5555 (555) 555-5555 x.555 555.555.5555 Ext. 555 555-555-5555 ext.555

    U28geW91IGNhbiBhbGwgcm90MTMgY
    W5kIHBhY2soKS4gQnV0IGRvIHlvdS
    ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
    geW91IHNlZSBpdD8gIC0tIEp1ZXJk
    

      Good solution, cutting to the chase.

      However, you still have to worry about malformations, such as phone numbers that aren't 7 or 10 digits. Often times, people will want to have the area code somewhere else, too. *shrugs*

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        However, you still have to worry about malformations, such as phone numbers that aren't 7 or 10 digits.

        So every database has numbers only within a single country? Not like any database I've ever used. I even thought about not filtering out leading plusses, but didn't do so, because I think this was homework anyway - and there should still be a challenge.

        As for 7 or 10 digits, I had no idea about how other countries have their telephone numbers, and think I should not guess.

        The fix: check length.

        U28geW91IGNhbiBhbGwgcm90MTMgY
        W5kIHBhY2soKS4gQnV0IGRvIHlvdS
        ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
        geW91IHNlZSBpdD8gIC0tIEp1ZXJk
        

      You love "5", don't you? :)

      The 555 area code is a well-known area code that appears only in Hollywood movies and other fiction stuff. There is even a web page about that: http://home.earthlink.net/~mthyen/

      So, if you want to "sanitize" a piece of code containing phone numbers (for privacy reasons and to fight spam... er telemarketing), you replace these by phone numbers from the 555-area.

      Yet, maybe Sevrin could have said 555-1234-5678 or 555-2002-0408 :-)

      update

      I have forgotten the following example. In Mac Perl, Power and Ease (published by Prime Time Freeware http://www.ptf.com/), both authors (Vicki Brown and Chris Nandor) give their phone numbers:

      $phone{"Vicki"} = "555-1234";
      $phone{"Chris"} = "555-4321"; 
      
      You can read it on-line at http://ptf.com/macperl/ptf_book/r/MP/120.SS.html#03

      Another update. Sevrin, may be you could look at some of the modules you get in http://search.cpan.org/search?mode=module&query=phone

Re: Simple Regex
by buckaduck (Chaplain) on Apr 08, 2002 at 17:12 UTC
    I think you mean the first 10 digits are the phone number, not the first 7.

    If you can assume that the first 10 digits are the phone number, and the remaining digits are the extension (which is a big assumption):

    # Get rid of all non-digits $number =~ s/\D//g; # Break the number into groups of digits my $phone = substr($number,0,10); my $extension = substr($number,10);

    buckaduck

Re: Simple Regex
by ilcylic (Scribe) on Apr 08, 2002 at 17:29 UTC
    Another thing you might want to consider is looking in the string for an "x" somewhere, grabbing the characters after it which include at least one number, up to the first space you see, and moving that whole string (x 534, ext 611, xt9411, etc) to the back of the overall string, in order to ensure that the number string you have left over once you've done your s/\D*/g has the areacode and phonenumber at the beginning.

    If you have ext 433 (505)666-7777, you don't want to just strip the non digit chars and substr the first 10 digits as the phone number.

    Of course, if you know that the phone numbers always have the extension second (because of the way they were put into the database) then you don't have to worry about it.

    Good luck.

    -il cylic
Re: Simple Regex
by jwest (Friar) on Apr 08, 2002 at 17:19 UTC
    One way to do it is to eliminate all of your problem characters first. This way, you won't have to think through a more complex RE:
    s/\D//g; /(\d{10})(\d*)/;

    Of course, this assumes that all phone numbers have the correct numbers in all the right places (three digit area code, seven digit number, and possibly an extension). It'll probably be right for a good number of rows, but the only way to be sure is to eyeball the output and compare it.

    Hope this helps!

    --jwest

    -><- -><- -><- -><- -><-
    All things are Perfect
        To every last Flaw
        And bound in accord
             With Eris's Law
     - HBT; The Book of Advice, 1:7
    
      And then there are the local versions of non-U.S. numbers, and the long distance versions of dialing them... Unless the application is itself going to be involved in dialing, it may be permissible to leave them as is, and suggest database users to edit/correct the fields on subsequent viewings on the fly.
Re: Simple Regex
by sevrin (Initiate) on Apr 08, 2002 at 18:05 UTC

    Thanks, everyone. Combined, you've given me enough to go on to solve the problem myself, which is the way it should be.

    /Scott
Re: Simple Regex
by mrbbking (Hermit) on Apr 08, 2002 at 18:34 UTC
    I use this to format US phone numbers as 10 straight digits. Once you do this, you can use substr to split it up and insert parens or dots or hyphens or what have you...
    sub format_phone { my @out = @_; foreach (@out){ tr/a-cA-C/2/; tr/d-fD-F/3/; # change letters to digits. tr/g-iG-I/4/; tr/j-lJ-L/5/; tr/m-oM-O/6/; tr/p-sP-S/7/; tr/t-vT-V/8/; tr/w-zW-Z/9/; s/[^\d]//g; # remove non-digits. s/^1//g; # remove first digit if it's a one. $_ = pack( 'A10', $_ ); # Only take the first ten digits. } return wantarray ? @out : $out[0]; }
    Since there seems to be no standard place for the 'Q' or the 'Z' on the numeric keypad, I put them where I like them; sequentially. I've seen some phones that put them both on the nine or the zero, for some reason passing understanding. If you want them somewhere else, I won't complain.