Python merge two dataframes based on multiple columns

first dataframe df has 7 columns, including county and state.
second dataframe temp_fips has 5 colums, including county and state.
I would like to merge them based on county and state.

df = df.merge(temp_fips, left_on=['County','State' ], right_on=['County','State' ], how='left' )

In the beginning, the merge function failed and returned an empty dataframe. I found that my State column in the second dataframe has extra spaces, which caused the failure. I used the following code to remove extra spaces, then merged them again.

df['State'] = df['State'].str.replace(' ', '')

Here are some problems I had before when using the merge functions:

1. ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

df_import_month_DESC_pop = df_import_month_DESC.merge(df_pop, left_on='stat_year', right_on='Year', how='left', indicator=True)

ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

The problem is caused by different data types. The key variable could be string in one dataframe, and int64 in another one.
Solution:
df_pop = pd.DataFrame({'Year':['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'],
'Population':['309321666', '311556874', '313830990', '315993715', '318301008', '320635163', '322941311', '324985539', '326687501', '328239523']})
df_pop['Year']=df_pop['Year'].astype(int)
df_import_month_DESC.shape
df_import_month_DESC_pop = df_import_month_DESC.merge(df_pop, left_on='stat_year', right_on='Year', how='left', indicator=True)

2. ValueError: Cannot use name of an existing column for indicator column

It’s because “_merge” already exists in the dataframe. You can change the “indicator=True” clause to another string, such as indicator=’Check’

Definition of the indicator variable in the document:

indicator: bool or str, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

Leave a Reply

Your email address will not be published. Required fields are marked *