Check the similarity of two columns in a dataframe (fuzz matching); drop missings

I merged two dataframes based on variable names, but i want to double check to maker sure the definition of each variable name is the same. But the definitions in different years use slightly different texts.

df

I want to check the similarity between the column “Definition” and “Definition2015”.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df_concordance_2015_matched['Partio_ratio']= df_concordance_2015_matched.apply(lambda x: fuzz.partial_ratio(x['Definition'],x['Definition2015']),axis=1)

results:

fuzz results

caveat:
If there are missing values in the two columns, the syntax will fail. So I drop the missing first before using the fuzz function. In my case, the missings are caused by failed merges, and I don’t really care about the not matched variables:

df_concordance_2015_matched = df_concordance_2015[df_concordance_2015['_merge']=="both"]

or you can use the following code to drop missings from a specific columns:

df_concordance_2010_2015_temp = df_concordance_2010_2015.dropna(subset=['Defintion_2015'])

Reference:

http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/

Leave a Reply

Your email address will not be published. Required fields are marked *