Cleaning Phone Numbers with Regular Expressions
August 30th, 2009
I’ve been in the position of having to take an unnormalized database that had virtually no data validation or standardization in place, and migrating it to a completely different, normalized schema. I wielded the power of regular expressions (regex) to help me through the process.
This post will deal specifically with phone numbers. The data I was importing had many problems: First, there was no standard formatting—some numbers were stored (xxx) xxx-xxxx, some xxx-xxx-xxxx, some xxx.xxx.xxxx, etc. Second, there wasn’t a separate field for extensions—they were just tacked on the end by either ext., EXT, x, Ex, or some variation. If there were only 20 numbers or so you could just fix them by hand, but you need an automated process to deal with say, 15,000.
The Function
function phone_clean($data){ // Cleans phone numbers and strips extensions // Returns array [number,extension] $pattern = '/\D*\(?(\d{3})?\)?\D*(\d{3})\D*(\d{4})\D*(\d{1,8})?/'; if (preg_match($pattern, $data, $match)){ if ($match[3]){ if ($match[1]){ $num = $match[1].'-'.$match[2].'-'.$match[3]; } else{ $num = $match[2].'-'.$match[3]; } } else{ $num = NULL; } $match[4] ? $ext = $match[4] : $ext = NULL; } else{ $num = NULL; $ext = NULL; } $clean = array($num,$ext); return $clean; }
Sample use, utilizing the db_prep function I wrote about earlier:
$original_string = '(555)555.5555 Ext. 1234' $clean = phone_clean($original_string); // The array $clean will now have the stripped number and extension // If no match, will be NULL $number = db_prep($clean[0]); $ext = db_prep($clean[1]); // $number and $ext are ready to be inserted into DB
Sample I/O:
| Original String | Captured Number | Extension |
|---|---|---|
| (555)123-1234 | 555-123-1234 | |
| (555)123.1234 | 555-123-1234 | |
| 555.123.1234 | 555-123-1234 | |
| 5551231234 | 555-123-1234 | |
| (555)123-1234 Ext.9876 | 555-123-1234 | 9876 |
| 555-123-1234×9876 | 555-123-1234 | 9876 |
| 555.123.1234.9876 | 555-123-1234 | 9876 |
| 123-1234 | 123-1234 | |
| 123 – 1234 x 9876 | 123-1234 | 9876 |
| 123 1234×9876 | 123-1234 | 9876 |
| ph:555-123-1234 ex:9876 | 555-123-1234 | 9876 |
| phone:5551231234 | 555-123-1234 | |
| 55512312349876 | 555-123-1234 | 9876 |
Caveats
These are the rules of the regex pattern:
- An area code is not required, and may or may not be enclosed by parenthesis
- Digit groups may or may not be delimited
- If there is no area code, the extension must be delimited somehow
- An extension must be 1-8 digits in length
Pattern Breakdown
'/\D*\(?(\d{3})?\)?\D*(\d{3})\D*(\d{4})\D*(\d{1,8})?/'
The underlined parts are the parenthesized sub-expressions that are outputted to the $match array in the above function. For example: Given a phone number (555) 111-2222 Ext. 3333; $match[1] = '555', $match[2] = '111', $match[3] = '2222' and $match[4] = '3333'. The first value of a match array, $match[0], is always the original string that matched the pattern, in this case '(555) 111-2222 Ext. 3333'.
The conditional statements within the function first check to see if $match[3] exists, if so—we have a phone number. It then checks to see if an area code exists, $match[1], then formats the number accordingly. If $match[4] exists, we have an extension.
Special note about \D*
This permits zero or more of anything that is not a digit (0-9). This allows for things like ‘ph:’ to prefix the number, just in case. I used this for each delimiter as well, which will catch anything used in between digit groupings. I had originally used [-\s.]? as my delimiter check, and \s*(e?xt?)?[-\s.]* for the extension delimiter because I knew that encompassed all the data I’d be processing. For this post, I changed it to \D* because that covers all the same bases, plus allows for multi-character delimiters not confined to dash, space or period.
This would not be a good regex pattern to use for validating data from an online form. Rather, this does the job on already existing data that I knew were phone numbers, and just needed to be cleaned.
There are many ways to skin a cat… I wrote this regex pattern to fit my own parameters, and it worked perfectly for me. Your own circumstances may require you to change the regex pattern to fit your needs.
—
If you have any questions, comments or suggestions; please comment or email me.
Tweet