SQL Searching with Relevance and Natural Language Processing

Working at a start-up in the early stages things are often built to solve a problem at the time, more often than not with a specific client in mind and time-frame in mind. As the company grows and becomes more established sometimes this very same code needs to scale or change based on features or performance original anticipated.

One example of this is if your building your own search queries in SQL.

We’ve all done it. They start out all pretty like this:

SELECT *
FROM clients
WHERE email LIKE :query

And evolve:

SELECT *
FROM clients
WHERE (
	first_name LIKE :query
	OR last_name LIKE :query
	OR email LIKE :query
)

Maybe this:

SELECT *, CONTACT(this, ' ', that, IF(selected=1, ' selected ', ''), ' etc') as search
FROM clients
JOIN addresses ON address_id=id
HAVING search LIKE :query

One of the problems we found with using LIKE is that it is a binary row selection, that is to say something is either included or not included. As well as this, if the search returns 100 results, the one you are looking for might be down there in the 90’s even though you matched it exactly – but it was found.

To try to combat this we moved to RLIKE and broke the words into different OR words using a pipe (|). This also fell short, in fairness for pretty much the same reason. It did find the results, but with no particular order. As well as that, if you searched within an address with the query, ‘beach road’, it would find every street with the word road in it (which is quite a few).

Our Solution (for the moment anyway)#

After doing some digging around on the internet and coming up with results like Soundex, fuzzy matching and Elasticsearch. Nothing seemed a perfect blend of simple to implement, fast and didn’t require us shipping off our data to a third-party.

That is until we found Full Text searching using natural language. What this is is a search method found within MySQL which allows you to search within some data and a score based on relevance will be returned. This means that if you match the query exactly you will get a perfect score, but the natural of natural language searches mean that certain words like ‘a’ and ‘the’ will get removed (as they are deemed irrelevant).

Here’s an example:

SELECT *, MATCH(first_name, last_name, email) AGAINST (:query IN NATURAL LANGUAGE MODE) as score
FROM clients
HAVING score > 1
ORDER BY score DESC

In the example we put the search within the select then filter out irrelevant results within the HAVING – but it would be fine to duplicate to the match and put it into the WHERE because the optimizer figures out they are the same bit of code and runs them only once.

It’s important to note that the only way this query above will run is if all the columns are contained within a single FULLTEXT index (note: don’t make the same mistake I originally made by putting three separate indexes on the columns).

CREATE FULLTEXT INDEX index_name
ON table_name (col1, col2, col3)

Searching Across Tables#

Because the columns we are searching must have an index against them, it makes searching data across tables (with join) very difficult – because an index cannot span two or more tables. We can, however, avoid this problem. If we treat them as two different searches, then add them together, the total score is all that matters. Using this technique we can all give certain columns or searches bias to emphasize some data over other data.

SELECT *, 
       SUM(
           (MATCH(first_name, last_name, email) AGAINST (:query IN NATURAL LANGUAGE MODE) * 2)
           + MATCH(number) AGAINST (:query IN NATURAL LANGUAGE MODE)
       ) as score
FROM clients
JOIN phone_numbers on phone_numbers.client_id=clients.id
HAVING score > 1
ORDER BY score DESC

In the example above we also allow searching on phone numbers, but we say that the name and email are twice as important compared with the phone number. This becomes especially useful when searching on various addresses or names at the same time – as it perhaps isn’t so important when working with the example above.

That’s about all there is to it, I’m sure we’ll likely need to move our search beyond this method in time but it seems to be working well at the moment. Here are two StackOverflow Q/A’s which helped me out: