I merged two dataframes based on variable names, but i want to double check to maker sure the definition of each variable name is the same. But the definitions in different years use slightly different texts.
I want to check the similarity between the column “Definition” and “Definition2015”.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df_concordance_2015_matched['Partio_ratio']= df_concordance_2015_matched.apply(lambda x: fuzz.partial_ratio(x['Definition'],x['Definition2015']),axis=1)
results:
caveat:
If there are missing values in the two columns, the syntax will fail. So I drop the missing first before using the fuzz function. In my case, the missings are caused by failed merges, and I don’t really care about the not matched variables:
df_concordance_2015_matched = df_concordance_2015[df_concordance_2015['_merge']=="both"]
or you can use the following code to drop missings from a specific columns:
df_concordance_2010_2015_temp = df_concordance_2010_2015.dropna(subset=['Defintion_2015'])
Reference:
http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/