TDD Exercise: Build a Fuzzy Search Library! 🔎
Fuzzy search finds matches even when the search term is misspelled or not exactly correct.
Inspiration
Recently, at work, I had the opportunity to explore use-cases for Fuzzy Search. All along, Fuzzy Search was a topic that I vaguely knew about. I hadn't explored it. This time, the situation was different. I understood what Fuzzy Search is, it's use-cases, and it's limitations. On the other hand, I wanted a fun exercise to practise my TDD skills to become sharper at the craft of Programming. Building a simple, Open Source Fuzzy Search library seemed to be an ideal weekend exercise!
I love creating custom, self-tailored assignments for myself which help me learn new concepts / topics / skills. That's predominantly how I've been able to keep my skills sharp being a self-taught programmer over the years. This TDD exercise is one of those custom self-assignments which I'm sharing with you. If this one is a success, I'll create more assignments for you in the times to come.
What is Fuzzy Search?
Fuzzy search finds matches even when the search term is misspelled or not exactly correct. It works by comparing words based on their similarity, rather than exact matches. For example, searching for "apple" might also show results for "appl" or "applle." It’s useful when you want to find things that are close but not identical.
Explain like I'm 5: If you're looking for your favorite toy, but you call it "bunny" instead of "bunny rabbit," fuzzy search will still help you find it because it understands your mistake!
Fuzzy search plays a crucial role in Information Retrieval by improving search results when there are variations in the input queries. It finds relevant information even when there are typos, misspellings, or different word forms in the search term. This increases the accuracy and usability of search engines or databases, especially when users make common mistakes or are unsure about the exact phrasing. (If you're a Computer / IT engineering student, this article should be super useful to you)
For example, when searching for "computer" but typing "comuter," fuzzy search ensures that relevant results for "computer" are still shown, even though the search term is not an exact match. This is especially important in environments with large datasets or in cases where the data is not perfectly structured. By accommodating such discrepancies, fuzzy search enhances user experience and retrieval quality in dynamic and diverse data sources.
What is TDD (Test Driven Development)?
Test-Driven Development (TDD) is a software development practice that emphasises writing tests before writing the actual code. The rules of TDD are as follows:
Write a Failing Test First: Start by writing a test that defines the functionality or behavior you expect from the code. Initially, this test will fail because the functionality hasn't been implemented yet.
Write the Minimum Code to Pass the Test: Write just enough code to make the test pass. The goal is not to solve the entire problem but to implement the smallest piece of code required to pass the test.
Refactor the Code: Once the test passes, refactor the code to improve its structure or efficiency without changing its functionality. The tests should still pass after the refactor.
Repeat: Repeat this cycle (write a test, write code, refactor) for every new piece of functionality, ensuring that your code remains clean and well-tested.
This cycle helps ensure that code is developed with a clear understanding of the requirements, maintains high test coverage, and stays modular and maintainable.
There's a lot that can be talked about TDD, but for now, this is all you need to know to proceed. Just remember these 4 rules well, imprint them into your mind and recollect them as you work on the project.
How to use this guide?
This is unlike regular guides on the Internet. Clone my GitHub repository and follow along from the first commit. I committed my code in such a way that it's easy to reference in this tutorial. I will explain each commit and the underlying thought process. As you work your way through the commits, feel free to make your own changes or tweak the tests or implementation things as per your liking. Tinkering, by making changes to source code is one of the best ways to learn and I highly encourage it!
After you complete all the steps, feel free to resume the project from scratch and make all the changes yourself without looking at my implementation. This will reinforce the ideas that you have learnt and will help you get your hands dirty with some real code. Take a look at the last section for some "bonus tasks". After I complete this article, I'll continue making improvements to the project, so feel free to star the repo.
Steps: Let's Start Building!
Let the journey commence!
We start by adding a vitest config, a tsconfig, a package json file and some dummy tests. These are some configuration files required for your TypeScript / Node environment, both, for build and test phases. The package json file includes npm scripts for the same.
As per the API defined in the README, we setup the tests and corresponding function signatures.
We define the inputs and outputs to the core logic that will be called by our consumers (by consumers we mean developers who will use the library).
Ideally this should have been committed with tests first, which would indicate the failing step, and then the code, but I forgot to do that and committed both together.
We add some simple tests to check the correctness of the string similarity score calculator that we will implement in the next step. This will lead to failing tests.
Implement the function: We implement Levenshtein distance for now.
We want to make sure that the threshold that's input isn't outside the permissible bounds (0 to 1, inclusive). 0 indicates complete mismatch and 1 indicates exactly similar strings. We add the tests which will lead to the failing step. In the case where the threshold is invalid, an error is thrown.
Implement the function: Implement threshold validation. Simple conditional does the trick.
Null and undefined checks: I like being defensive and hence added these tests for completeness. Nothing special here, feel free to skip this step.
For the similarity score to be accurate, we need to consider multiple types of distance metrics. Certain distance metrics like Levenshtein distance is less effective as compared to Jaro Winkler when there is a prefix match between the two strings. This commit will lead to failing tests which will be fixed in the next step.
Implement the function: We update our similarity score calculator to use the average of Levenshtein distance and Jaro Winkler distance. For both these metrics, we use a third-party library although it's easy to implement them yourself. Feel free to take that up as a bonus exercise.
Again, this commit contains both the tests and implementation. Ideally, they should be separate but I forgot that again. Another point to note, the updated similarity score calculation is included in this commit and not the previous one.
Logic: We simply filter out all the records (strings) who's similarity score is less than the threshold and return the result.
We will need a utility function to check if the records contain all array of objects. You will soon understand why, follow along. This commit includes the failing tests.
Implement the function: Just a simple loop and conditional. Some JS functional magic too. I'll leave it up to you to figure out what that is.
To make our Fuzzy Search module complete, we need to make it work with an array of objects. For this, we need to find a similarity score between the query string and the object being checked during the filter process.
Implement the function: We compute the similarity scores of all key-value pairs, including recursive objects and use the largest value as the final score. For instance, if the object has 7 key value pairs in total with scores 0.2, 0.3, 0.7, 0.4, 0.2, 0.4, 0.4 then 0.7 (the highest value) is taken into consideration. While I won't go into the details of whether this is the best approach or not because I haven't done enough research there, I'll leave it to you as a homework exercise to experiment and find out a better approach (and then let me know in the comments section!).
We add a simple failing test to run an object search with a string query.
Implement the function: We check if the record contains an array of objects (now you see what I was saying earlier?) and then simply compute the object similarity score that we built in the previous step. Easy.
Refactors
Validate Threshold Refactor: We refactor the logic that validates and retrieves the threshold. We also set a constant value (in another constants file) for the default threshold incase it is not provided in the input.
eFuzz Refactor: Separate the logic that handles string records and object records into two different functions.
Note: Ideally, after every passing test, we should refactor. However, out of convenience, I skipped this step and did a big refactor at the end. I don't recommend this. If you're following this article, follow the 4 steps of TDD exactly as prescribed for best results.
Chores
Cleanup Styling: Change import styles. Use default imports everywhere. (optional)
Negative Test For No Matches: Negative test for completeness of eFuzz.
General Naming Fixes: Nothing fancy here.
Design Decisions
Distance Score (Average): We use the average of Levenshtein distance and Jaro Winkler distance to calculate the final score. This is to account for variation in the effectiveness of a distance metric across different nature of input strings. I haven't experimented with any other methods, so feel free to think of alternatives and let me know.
Computing Similarity Score of String and Object: We calculate the similarity score for every field (including fields within nested objects) and pick the highest computed value. Check the previous section for an example. Again, I have not experimented with alternatives. This seemed like a good practical solution so I went ahead with it. I'm sure there's a more accurate way to do this.
Deciding the right threshold values for tests: Slight TDD violation, but I had to play around with the threshold values to decide what's a good range. There's no pragmatism in selecting a precise value because implementations and formulas could change, but a range is a good measure for what the behaviour of the algorithm should be. Probably for another library, 0.8 at certain places would be better than 0.6.
Benchmarks
Comparison with other Libraries: I ran a simple benchmark with fuzzysort (a popular fuzzy search library) just to be "data driven" and know where the current capabilities of this project are. I used benchmarkjs and faker for running the benchmarks and generating the fake data. You can refer to this commit for the exact implementation.
As you can see, fuzzysort is able to perform 105,783 operations per second on 1000 records, making it much faster than efuzz, which can only perform 675 operations per second. The margin of error for both benchmarks is very low, indicating that the results are reliable. Overall, fuzzysort is the faster algorithm for the given test case.
Conclusion & Next Steps
Bonus Tasks: explore alternatives for similarity score, find better ways to calculate similarity between string and object, refactor regularly (follow 4 steps of TDD strictly), aand, lastly, publish your own library to NPM registry with something uniquely your own!
Plans for the future: As you can see, the benchmarks don't look great. I will be working on performance optimisations and then add more advanced features.
Contributions: Feel free to fork, modify, and submit a PR! If you want to copy the code and use it in your own projects, please include the MIT LICENSE and add appropriate credits. Thank you!
Hope you find this useful! Let me know what you think in the comments section, and don't forget to try out the bonus tasks.
— Aditya Patange (AdiPat)
I actively work on Open Source Software, check out my GitHub Profile. ✨
Follow me on Instagram (@adityapatange), I talk about tech, meditation, startups and hip hop! ⚡️
I write byte-sized insights on Threads to supercharge your day. 💡