Java Developer

Fast Fuzzy String Matching with Grails/Hibernate and MySQL

The popular fuzzy string matching methods are Levenshtein distance, Soundex, Metaphone and Double Metaphone. I created a post on Levenshtein distance here. But the performance is very disappointing. Scanning 40,000 records took around 25 seconds.
An alternative is Soundex. It has some deficiencies in terms of accuracy, specially if the language is not English. However, since this is the most widely known phonetic algorithm, it has native support in MySQL. It is also extremely fast. When I switched algorithms, the query ran from 25 seconds from Levenshtein, down to a fraction of a second for Soundex.

If you will use it in MySQL directly, the syntax is like this:

select * from person where last_name sounds like 'Doe' and first_name sounds like 'John';

However, we can't do this in hibernate query language:
from Person where p.lastName sounds like 'Doe' and p.firstName sounds like 'John';
HQL does not understand sounds and like together.

A workaround is to write a stored function. Create sp.sql with contents:

CREATE FUNCTION fuzzy_match (s1 VARCHAR(255), s2 VARCHAR(255))
    RETURN s1 sounds like s2;
END $$

Save to your database instance:

$ mysql -u root -ppassword myschema < sp.sql 

And now you can revise your HQL in Grails or Hibernate project:

def lastName = 'Doe'
def firstName = 'John'
def list = Person.executeQuery("from Person p where fuzzy_match(p.lastName, :lastName) > 0 and fuzzy_match(p.firstName, :firstName) > 0 ",
                    [lastName:lastName, firstName:firstName])

Simple and fast!
Tags: fuzzy, gorm, grails, hql, levenshtein distance, mysql, search, soundex, stored function, stored procedure