Cleaning Phone Numbers with Regular Expressions

August 30th, 2009 / Code

I've been in the position of having to take an unnormalized database that had virtually no data validation or standardization in place, and migrating it to a normalized schema. I used regex to help me through the process.

This post will deal specifically with phone numbers. The data I was importing had many problems: First, there was no standard formatting—some numbers were stored (xxx) xxx-xxxx, some xxx-xxx-xxxx, some xxx.xxx.xxxx, etc. Second, there wasn't a separate field for extensions—they were just tacked on the end by either ext., EXT, x, Ex, or some variation. If there were only 20 numbers or so you could just fix them by hand, but you need an automated process to deal with say, 15,000.

The Function

function phone_clean($string){
  // Cleans phone numbers and strips extensions
  // Returns array [number,extension]
  $pattern = '/\D*\(?(\d{3})?\)?\D*(\d{3})\D*(\d{4})\D*(\d{1,8})?/';
  if (preg_match($pattern, $string, $match))
  {
    if ($match[3])
    {
      if ($match[1])
      {
        $num = $match[1].'-'.$match[2].'-'.$match[3];
      }
      else
      {
        $num = $match[2].'-'.$match[3];
      }
    }
    else
    {
      $num = NULL;
    }
    $match[4] ? $ext = $match[4] : $ext = NULL;
  }
  else
  {
    $num = NULL;
    $ext = NULL;
  }
  return array($num,$ext);
}

Sample I/O:

Original String Captured Number Extension
(555)123-1234 555-123-1234
(555)123.1234 555-123-1234
555.123.1234 555-123-1234
5551231234 555-123-1234
(555)123-1234 Ext.9876 555-123-1234 9876
555-123-1234×9876 555-123-1234 9876
555.123.1234.9876 555-123-1234 9876
123-1234 123-1234
123 – 1234 x 9876 123-1234 9876
123 1234×9876 123-1234 9876
ph:555-123-1234 ex:9876 555-123-1234 9876
phone:5551231234 555-123-1234
55512312349876 555-123-1234 9876

Caveats

These are the rules of the regex pattern:

  1. An area code is not required, and may or may not be enclosed by parenthesis
  2. Digit groups may or may not be delimited
  3. If there is no area code, the extension must be delimited somehow
  4. An extension must be 1-8 digits in length

Pattern Breakdown

'/\D*\(?(\d{3})?\)?\D*(\d{3})\D*(\d{4})\D*(\d{1,8})?/'

The underlined parts are the parenthesized sub-expressions that are outputted to the $match array in the above function. For example: Given the string "(555) 111-2222 Ext. 3333"; $match[1] = '555', $match[2] = '111', $match[3] = '2222' and $match[4] = '3333'. The first value of a match array, $match[0], is always the original string that matched the pattern, in this case '(555) 111-2222 Ext. 3333'.

The conditional statements within the function first check to see if $match[3] exists, if so—we have a phone number. It then checks to see if an area code exists, $match[1], then formats the number accordingly. If $match[4] exists, we have an extension.

Special note about \D*

This permits zero or more of anything that is not a digit (0-9). This allows for things like ‘ph:' to prefix the number, just in case. I used this for each delimiter as well, which will catch anything used in between digit groupings. I had originally used [-\s.]? as my delimiter check, and \s*(e?xt?)?[-\s.]* for the extension delimiter because I knew that encompassed all the data I'd be processing. For this post, I changed it to \D* because that covers all the same bases, plus allows for multi-character delimiters not confined to dash, space or period.

This would not be a good regex pattern to use for validating data from an online form. Rather, this does the job on already existing data that I knew were phone numbers, and just needed to be cleaned.