Joining two tables on string-likeness

2 Answers

Best answer

Equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions

there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away
while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away
only calculate operation cost, not operation type
do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches)

answer Jun 3, 2013 by anonymous

Looks like Sphinx might be a helpful addition to your stack. Way too much to explain how to do so here, but it's a proper fulltext engine, and has a MySQL-compatible interface.

answer Jun 3, 2013 by anonymous

Similar Questions

0 votes

How to compare two tables in MYSQL?

0 votes

How can I insert data into a table from two other tables in mysql?

How can I insert data into one table from two other tables where i have three tables namely users, role and userrole.
Now I want to insert the data into userrole table from users table and role table with a single statement.

0 votes

How to add two tables together with same table name but different content in MYSQL?

0 votes

How to make relationship between two tables in MYSQL?

+1 vote

Two hadoop nodes on same machine while a second machine not joining the cluster

I have a test cluster of two machines, on both of them hadoop is installed. I have configured the hadoop cluster but on admin UI (as in the below picture) I see that two nodes are running on the same master machine, and that the other machine has no Hadoop node.

On master machine following services are running:

~$ jps 26310 ResourceManager 27593 Jps 26216 DataNode 26135 NameNode 26557 NodeManager 26701 JobHistoryServer

On the slave machine:

~$ jps 2614 DataNode 2920 Jps 2707 NodeManager

I don't why the slave is not joining the cluster (It was before). I tried to shutdown all servers on both machines and format HDFS then restarting everything but that didnot help. Any help to figure whats causing that behavior is appreciated.

Joining two tables on string-likeness

Your comment on this post:

2 Answers

Your comment on this answer:

Your comment on this answer:

Your answer

Preview