Grails fuzzy string matching with MySQL and Levenshtein distance

This is a hackish approach for fuzzy string matching in Grails, as an alternative to searchable plugin, assuming MySQL database is used.

As an introduction, the levenshtein distance is used on how similar two strings are. It computes the minimum number of character substitution that has to be done, in order to convert one string from the other. For example, if the levenshtein distance is 0, it means the two strings are exactly the same. If levenshtein distance is 1, it means there is only 1 character differing the two strings. We can use this function for fuzzy matching using a very small treshold for the distance.

For illustration, assume we have a Person domain:

package example.fuzzy
class Person {
    String firstName
    String lastName
    String middleName
    String nickName
}

We can use a MySQL stored procedure to compute for Levenshtein distance. It would be much slower if you will fetch all records from the database and do the matching inside groovy/java. It is much more efficient if all computation are done on the database side. I’m using a stored procedure borrowed from here. Create an sql file ( E.g. sp.sql ) with the contents below.

DELIMITER $$

DROP FUNCTION IF EXISTS levenshtein_distance $$

CREATE FUNCTION levenshtein_distance (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
  DECLARE s1_char CHAR;
  DECLARE cv0, cv1 VARBINARY(256);
  SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = '\0', j = 1, i = 1, c = 0;
  IF s1 = s2 THEN
    RETURN 0;
  ELSEIF s1_len = 0 THEN
    RETURN s2_len;
  ELSEIF s2_len = 0 THEN
    RETURN s1_len;
  ELSE
    WHILE j <= s2_len DO
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
      SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
      WHILE j <= s2_len DO
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
      END WHILE;
      SET cv1 = cv0, i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END $$

DELIMITER ;

Note that this function seems to be case insensitive.

Create this to your MySQL schema:

$ mysql -u root -ppassword myschema < sp.sql 

Now inside your controller or service, you can do the fuzzy matching via HQL:

    def firstNameToMatch = 'john'
    def lastNameToMatch = 'Doe'
    def peopleWithSimilarName = Person.executeQuery("from Person p where levenshtein_distance (p.lastName,:firstNameToMatch) + levenshtein_distance (p.firstName, :lastNameToMatch) <= 2", [firstNameToMatch:firstNameToMatch, lastNameToMatch:lastNameToMatch])

And of course, since Grails is based on Hibernate, you should be able to do the same in straight Hibernate.

Based on my testing, this function is not suitable for large tables. In my i7 Ivy Bridge laptop, it takes 25 seconds to search through 40,000 records. Not that practical unless used on very small table.

Comments

comments