Filtering Out the Noise with Regex

July 13th, 2010 / Code

I rely heavily on regular expressions when processing old data. Like my post about cleaning phone numbers, I’ll demonstrate another example of how useful regex can be when dealing with unruly data.

The situation

I’ve got hundreds of products that each have unique attributes, e.g. a laptop would have attributes such as: processor, memory, drives, ports, etc. In the new database schema, the attributes are normalized so that they can be used to sort and filter the products. You’re ready to start importing the data but there’s one big problem—in the old database, attributes are stored as HTML in one big text field.

After some obligatory cursing, you take a close look at what you have to work with:

<strong><span style="FONT-SIZE: 8.5pt; FONT-FAMILY: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">Processor:</span></strong><span style="FONT-SIZE: 8.5pt; FONT-FAMILY: Arial; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"> 2.4Ghz Pentium IV,<strong> Memory:</strong> 256MB DDR SDRAM Expandable to 1GB, <strong>Display:</strong> 15" TFT Active Matrix, <strong>Video:</strong> ATI Mobility Radeon, 16MB VRAM, <strong>Sound:</strong> 16Bit Sound Blaster Pro, <strong>Battery:</strong> Rechargeable Li-ION 2 Hrs Run, 3 Hrs Charge,<strong> Drives:</strong> 3.5" Diskette, 40GB Hard Disk, 8X DVD, 24X CDRW, <strong>Ports:</strong> PS/2 (1), External VGA (1), S-Video, USB (2), 10/100 LAN, <font></font><strong>Size:</strong> 12.96" X 10.72" x 1.73" (WxDxH), <strong>Weight:</strong>7.69lbs, <strong>Security:</strong> Kensington</span>

Not pretty. It’s got HTML tags strewn about with an assortment of Microsoft “mso-” garbage. We need to take this and turn it into nice key => value associations, like “Processsor” => “2.4Ghz Pentium IV”. You can just tough it out and do some hardcore copy and paste, which would be the easiest option if you only had a few items to go through. But that isn’t practical when dealing with hundreds, plus it is more fun to try and solve the problem with regular expressions.

Plan of attack

In order to use regex to strip out the attributes and their values, you first need to look for patterns in the data. Notice how each attribute is enclosed in <strong> tags? Using that pattern, we can write some pseudo-regex that will help us later on:

Attributes: Always preceded by <strong> and followed by </strong>.
Values: Always preceded by </strong> and followed by <strong> or the end of the data.

Step 1: Filter

Since all we need to get our data are the <strong> tags, the first step is to get rid of all other tags:

$fix = preg_replace('/<(?!strong|\/strong)([^>]*)>/i', '', $original);

This regular expression says to look for the character <, and then any text that is not “strong” or “/strong” until it finds >, and replace it all with '' (deleting it).

Explain regex in detail.

This gets rid of all opening and closing tags that aren’t <strong> or </strong>, and saves that new string to the variable $fix. Our data already looks a lot cleaner:

<strong>Processor:</strong> 2.4Ghz Pentium IV,<strong> Memory:</strong> 256MB DDR SDRAM Expandable to 1GB, <strong>Display:</strong> 15" TFT Active Matrix, <strong>Video:</strong> ATI Mobility Radeon, 16MB VRAM, <strong>Sound:</strong> 16Bit Sound Blaster Pro, <strong>Battery:</strong> Rechargeable Li-ION 2 Hrs Run, 3 Hrs Charge,<strong> Drives:</strong> 3.5" Diskette, 40GB Hard Disk, 8X DVD, 24X CDRW, <strong>Ports:</strong> PS/2 (1), External VGA (1), S-Video, USB (2), 10/100 LAN, <strong>Size:</strong> 12.96" X 10.72" x 1.73" (WxDxH), <strong>Weight:</strong>7.69lbs, <strong>Security:</strong> Kensington

Step 2: Extract

Now we can extract the individual attribute names and their values. First we’ll get the attributes:

preg_match_all('/<strong>([^<]*)<\/strong>/i', $fix, $attributes);

Here we look for the string <strong> and then capture everything that isn’t a <, until we match the string </strong>. This repeats until the end of the data and sends all matches to an array called $attributes.

Explain regex in detail.

Now we’ll pull out the attribute values:

preg_match_all('/<\/strong>([^<]*)(?:<[^>]*>|$)/i', $fix, $values);

This looks for a </strong> tag and captures anything that is not a <, until we match any tag or the end of the string. This repeats until the end of the data and sends all matches to an array called $values.

Explain regex in detail.

We now have our attribute names and their values in seperate arrays, let's see what the array for attribute names looks like:

(
  [0] => Array
    (
      [0] => <strong>Processor:</strong>
      [1] => <strong> Memory:</strong>
      [2] => <strong>Display:</strong>
      [3] => <strong>Video:</strong>
      [4] => <strong>Sound:</strong>
      [5] => <strong>Battery:</strong>
      [6] => <strong> Drives:</strong>
      [7] => <strong>Ports:</strong>
      [8] => <strong>Size:</strong>
      [9] => <strong>Weight:</strong>
      [10] => <strong>Security:</strong>
    )
  [1] => Array
    (
      [0] => Processor:
      [1] =>  Memory:
      [2] => Display:
      [3] => Video:
      [4] => Sound:
      [5] => Battery:
      [6] =>  Drives:
      [7] => Ports:
      [8] => Size:
      [9] => Weight:
      [10] => Security:
    )
)   

You can see that preg_match_all creates a multidimensional array, the first sub-array contains the original strings that matched the regular expression, and the second contains the data we told it to extract. So, $attributes[1] and $values[1] has all the information we want.

Step 3: Clean

Our data is riddled with commas, colons, and spaces at their beginning and ends. This regex can clean it up:

preg_replace('/(^[,:\s]*|[,:\s]*$)/', '', $data);

Explain regex in detail.

Now we have two arrays of clean data, which when combined gives us the data we wanted from that ugly chunk of HTML we started with.

Array
(
  [Processor] => 2.4Ghz Pentium IV
  [Memory] => 256MB DDR SDRAM Expandable to 1GB
  [Display] => 15" TFT Active Matrix
  [Video] => ATI Mobility Radeon, 16MB VRAM
  [Sound] => 16Bit Sound Blaster Pro
  [Battery] => Rechargeable Li-ION 2 Hrs Run, 3 Hrs Charge
  [Drives] => 3.5" Diskette, 40GB Hard Disk, 8X DVD, 24X CDRW
  [Ports] => PS/2 (1), External VGA (1), S-Video, USB (2), 10/100 LAN
  [Size] => 12.96" X 10.72" x 1.73" (WxDxH)
  [Weight] => 7.69lbs
  [Security] => Kensington
)