Monday, December 05, 2005

Yonah's SQL Tip #4 - SOUNDEX and DIFFERENCE

What if your boss turned to you this week and asked you to write a simple search that looked up a contact in a database table by first or last name. On the surface it sounds easy. If your last name field was called LAST_NAME your query might look like this


SELECT *
FROM CONTACTS
WHERE LAST_NAME='theName'



Seems simple enough, but not without problems. Take my last name - Wolf - for example. Let's say you knew what my last name was, but you didn't know how it was spelled - so it could be Wolf, Wolfe, Wolff, Woolf, etc. The above query would require me to try every possible spelling to find a result. So maybe as a precaution, you would try to match the beginning of the word using like:


SELECT *
FROM CONTACTS
WHERE LAST_NAME like 'wolf%'


This would do a little better, but it has two major shortcomings - It will exclude names that should be included like Woolf, for example, and will include names that shouldn't like Wolfish or Wolfson. Of course, my last name is a limited example. If we were searching for, say, Smith, searching under
Smith will also return Smithson but not alternative spellings of Smith, such as Smythe.
Fortunately, there are two SQL functions that can help in this endeavour: SOUNDEX and DIFFERENCE.

SOUNDEX takes a string as an argument and returns a value that is based on the way that the string sounds (I am not 100% how this is caluculated, but it still seems to work). So re-writing our earlier query:


SELECT *
FROM CONTACTS
WHERE SOUNDEX(LAST_NAME)=SOUNDEX('WOLF')


This query will give us the desired effect of matching similarly sounding names with different spellings without excluding most spellings and without including different names. DIFFERENCE takes this one step further. It essentially takes two strings as arguments and then returns the difference of their SOUNDEX values. The difference is between 0-4, with 4 being the closest match. Using difference we can re-write the query one more time, expanding our horizons to include not-so-exact matches, but also ranking them by closeness:



SELECT *,DIFFERENCE(LAST_NAME) as RANK
FROM CONTACTS
WHERE DIFFERENCE(LAST_NAME,'WOLF')>=3
ORDER BY DIFFERENCE(LAST_NAME) DESC


What this query does is it will find names that are not quite 'Wolf', but closely related and then rank them in order of closeness (i.e. it will match, Wolf, Wolfe, and Wolfson and they will be ranked in that order respectively).

So the next time you're in a matching bind, give DIFFERENCE and SOUNDEXa try.

No comments: