The popular fuzzy string matching methods are Levenshtein distance, Soundex, Metaphone and Double Metaphone. I created a post on Levenshtein distance here. But the performance is very disappointing. Scanning 40,000 records took around 25 seconds.
An alternative is Soundex. It has some deficiencies in terms of accuracy, specially if the language is not English. However, since this is the most widely known phonetic algorithm, it has native support in MySQL. It is also extremely fast. When I switched algorithms, the query ran from 25 seconds from Levenshtein, down to a fraction of a second for Soundex.
This is a hackish approach for fuzzy string matching in Grails, as an alternative to searchable plugin, assuming MySQL database is used.
As an introduction, the levenshtein distance is used on how similar two strings are. It computes the minimum number of character substitution that has to be done, in order to convert one string from the other. For example, if the levenshtein distance is 0, it means the two strings are exactly the same. If levenshtein distance is 1, it means there is only 1 character differing the two strings. We can use this function for fuzzy matching using a very small treshold for the distance.
MySQL is a popular choice of database for many developers. It's open source, easy to install, stable, and very fast. However, MySQL comes with several storage engine and you need to decide which one to choose. The two most popular are:
MyISAM - this is the default storage engine for MySQL prior to version 5.5. I believe this is the fastest storage engine of all, but the drawside is it does not support transactions.
InnoDB - this is not as fast as MyISAM, but it support transactions. This is the most popular among Java/Grails developers.
During development, I recommend using a standalone database server like MySQL, instead of using the default HSQLDB or H2 database. Because you can open a client and inspect the table structure created, and also perform queries outside the Grails application.
To configure MySQL in your datasource, first download the driver from: http://dev.mysql.com/downloads/connector/j/
I wanted to restore a large production database before in my home computer, to replicate a particular bug. Since the database dump is so large, it took a lot of time. Around 30 minutes. The issue with such operation is that you don't know if the MySQL process hanged-up, or if it is still normally progressing. A good solution is to have a progress bar.
The default configuration of MySQL is not very fast. They made it generic so that it can run on servers with small memory (For example, a VPS with 256Mb RAM). But desktop nowadays have so much memory and disk space that we can trade off for faster performance.
Performance is important because it affects your productivity. I have experimented with many configurations, but below is the simplest that I came up with that gave me good results.