Recently I had a request to match some counterparty names from different files.
I decided to script it quickly in a python notebook to validate the approach.
I created some test data with mockaroo to simulate the use-case.
As one of the files had around 500K entries, the Levenshtein Distance algorithm, which I initially tried quite successfully, took nearly 5 minutes per record on my laptop.
The Levenshtein distance is a string metric for ranking the difference between two sequences. Informally, the Levenshtein distance between two strings is the minimum amount of single-character modifications.
I had to do the same for several thousands of counterparty names. It could take weeks to execute.
We all know that Python is not the best language for out-of-the-box performance and multithreading.
And I matched it against some random combinations of names and surnames taken from NameDatabases .
I decided to try ray and go full multithread! And also spread the workload over a cluster.
Install it with pip:
pip install ray
It has a simple API:
- ray.init(): Used to initialize the context, you can specify the number of CPUs to use.
- @ray.remote: Is a decorator that makes your class or function a ray actor.
- .remote: Invokes an asynchronous remote call to a method.
- ray.put(): Passes an object to a remote function synchronously. It returns its ID.
- ray.get(ids): Also synchronous returns and object or list of objects from the ID or IDs.
- ray.wait(ids): Returns the lists of objects that are ready and not ready.
It is a simple example of how to use ray to parallelize tasks; it easily can be expanded to many other use cases.
The obvious one is hyper-parameter optimization; it will allow you to spread the workload across multiple GPUs.
If you want to dig deeper into this topic, I suggest you review the design patterns for ray.
There are many integrations with ray, for example spacy:
pip install spacy-ray
You have a list of integrations here.
As a bonus, you can also use ray on the JVM; find some Java examples in the documentation. Not only that, you can create cross-language execution with Python calling Java actors and vice-versa.