top button
Flag Notify
    Connect to us
      Facebook Login
      Site Registration Why to Join

Facebook Login
Site Registration
Print Preview

Joining two tables on string-likeness

+1 vote

I wish to join two tables on likeness, not equality, of character strings. Soundex does not work. I am using the Levenstein edit distance, written in SQL, a very costly test, and I am in no position to write it in C and link it to MySQL--and joining on equality takes a fraction of a second, and this takes hours. Any good ideas?

posted Jun 3, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button Google+ Share Button LinkedIn Share Button Multiple Social Share Button

2 Answers

+1 vote
Best answer

Equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions

there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

  • if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away

  • while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away

  • only calculate operation cost, not operation type

  • do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches)

answer Jun 3, 2013 by anonymous
+1 vote

Looks like Sphinx might be a helpful addition to your stack. Way too much to explain how to do so here, but it's a proper fulltext engine, and has a MySQL-compatible interface.

answer Jun 3, 2013 by anonymous
Similar Questions
0 votes

How can I insert data into one table from two other tables where i have three tables namely users, role and userrole.
Now I want to insert the data into userrole table from users table and role table with a single statement.

+1 vote

I have a test cluster of two machines, on both of them hadoop is installed. I have configured the hadoop cluster but on admin UI (as in the below picture) I see that two nodes are running on the same master machine, and that the other machine has no Hadoop node.

On master machine following services are running:

~$ jps 26310 ResourceManager 27593 Jps 26216 DataNode 26135 NameNode 26557 NodeManager 26701 JobHistoryServer 

On the slave machine:

~$ jps 2614 DataNode 2920 Jps 2707 NodeManager 

I don't why the slave is not joining the cluster (It was before). I tried to shutdown all servers on both machines and format HDFS then restarting everything but that didnot help. Any help to figure whats causing that behavior is appreciated.

Useful Links with Similar Problem
Contact Us
+91 9880187415
#470/147, 3rd Floor, 5th Main,
HSR Layout Sector 7,
Bangalore - 560102,
Karnataka INDIA.