It's a tale as old as time: your company has multiple instances of the same dataset which should match, but for whatever reason, don't. An obvious example that comes to mind might be discrepancies between environments: your staging environment probably falls out of sync with production often. This is an easy problem to solve because it's clear why these types of systems get out of sync. Solving the problem is simple as syncing one system to the other, usually where one system serves as the "source of truth." But what about discrepancies between systems without a source of truth? What do we do when parallel systems should have comparable data, but don't? Lastly, how can we figure out what went wrong?

Companies are prone to tons of scenarios for this exact problem, and they're usually the kind of situations that are deadly when unresolved. For example, let's say your company tracks new hires by adding them to an HR system (such as Bamboo or Workday). These new hires need access to internal company systems (like Confluence, or anything behind your company's VPN). Your company prides itself on being "scrappy," thus depends on an IT department to provision these users across systems manually. Assuming nobody has made a user error, these systems will have matching user information... until those hires are terminated.

No matter the situation, you're nearly guaranteed to run into this issue at some point. Luckily for us, the solution isn't too complicated.

Setting the Stage

The assumption here is that we’re comparing the rows in our data. We need two datasets which have matching columns, but different entries. Sticking to our employee example, I'm going to use two fake datasets containing employee information as such:

id first_name last_name email gender department
424 Regine Aberkirder raberkirderbr@shop-pro.jp Female Marketing
989 Jessa Acome jacomerg@ustream.tv Female Business Development
318 Marci Acres macres8t@paypal.com Female Marketing
996 Olympe Acutt oacuttrn@flickr.com Female Support
516 Baxy Adamides badamideseb@google.pl Male Research and Development
743 Melba Addey maddeykm@vimeo.com Female Accounting
207 Willie Addionisio waddionisio5q@hhs.gov Male Human Resources
764 Noby Agutter nagutterl7@ow.ly Male Accounting
950 Avery Ahlin aahlinqd@ihg.com Male Research and Development
434 Richart Akid rakidc1@ow.ly Male Engineering
863 Kalindi Alebrooke kalebrookeny@biglobe.ne.jp Female Product Management
250 Marley Alen malen6x@fema.gov Female Legal
481 Wilfrid Algie walgiedc@paypal.com Male Product Management

We'll make two Pandas DataFrames from these similar data sets:

df1 = pd.read_csv('data/employees1.csv')
df2 = pd.read_csv('data/employees2.csv')

Now let's get to work.

Pandas Merge With Indicators

The first piece of magic is as simple as adding a keyword argument to a Pandas "merge." When merging two DataFrames in Pandas, setting indicator=True adds a column to the merged DataFame where the value of each row can be one of three possible values: left_only, right_only, or both:

id first_name last_name email gender department _merge
990 Yanaton Tumioto ytumiotol@ehow.com Male Training right_only
962 Lara Minichillo lminichillom@photobucket.com Female Accounting right_only
976 Reine Haldon rhaldonn@bbc.co.uk Female Services right_only
961 Lyman Marcq lmarcqp@merriam-webster.com Male Engineering right_only
221 Findlay Danett fdanettr@ibm.com Male Human Resources both
675 Muhammad Rimbault mrimbaults@feedburner.com Male Research and Development left_only
692 Mindy Ruggen mruggent@unc.edu Female Marketing left_only
669 Isidoro Reedie ireedieu@so-net.ne.jp Male Business Development left_only
783 Niels Strase nstrasev@utexas.edu Male Legal both
120 Jock Bucktharp jbucktharpw@stanford.edu Male Research and Development both
734 Leanora Sigert lsigertx@so-net.ne.jp Female Sales both
52 Bea Behneke bbehnekey@163.com Female Support both
563 Cindee Moorcroft cmoorcroftz@hao123.com Female Support both
172 Kelvin Clubb kclubb10@gmpg.org Male Product Management both
362 Stella Hirtz shirtz11@hibu.com Female Legal both
273 Archibold Eager aeager12@macromedia.com Male Services both

As you might imagine, rows marked with a value of "both" in the merge column denotes rows which are common to both DataFrames. left_only and right_only mark rows which were present in either the left or right DataFrame, respectively.

We already have a big piece of the puzzle solved. Let's build off of this to create a reusable function which returns exactly what we're looking for each time.

Creating a Code Snippet

Let's build a function called dataframe_difference() which answers any of 4 questions:

  1. Which rows were only present in the first DataFrame?
  2. Which rows were only present in the second DataFrame?
  3. Which rows were present in both DataFrames?
  4. Which rows were not present in both DataFrames, but present in one of them?

For starters, our function dataframe_difference() will need to be passed two DataFrames to compare. To make things interesting, let's add an optional keyword argument which allows us to return rows for each of the four scenarios above:

def dataframe_difference(df1, df2, which=None):
    """Find rows which are different between two DataFrames."""
    comparison_df = df1.merge(df2,
                              indicator=True,
                              how='outer')
    if which is None:
        diff_df = comparison_df[comparison_df['_merge'] != 'both']
    else:
        diff_df = comparison_df[comparison_df['_merge'] == which]
    diff_df.to_csv('data/diff.csv')
    return diff_df

If we call our function without a keyword argument (ie: dataframe_difference(df1, df2) ), our function will answer question #4:

id first_name last_name email gender department _merge
990 Yanaton Tumioto ytumiotol@ehow.com Male Training right_only
962 Lara Minichillo lminichillom@photobucket.com Female Accounting right_only
976 Reine Haldon rhaldonn@bbc.co.uk Female Services right_only
961 Lyman Marcq lmarcqp@merriam-webster.com Male Engineering right_only
675 Muhammad Rimbault mrimbaults@feedburner.com Male Research and Development left_only
692 Mindy Ruggen mruggent@unc.edu Female Marketing left_only
669 Isidoro Reedie ireedieu@so-net.ne.jp Male Business Development left_only
746 Rhianon Sloan rsloan16@ovh.net Female Sales left_only
952 Jennilee Palle jpalle19@state.gov Female Marketing right_only
970 Nedda Grieger ngrieger1c@quantcast.com Female Marketing right_only

As expected, values in the merge column contain left_only and right_only, but not both. The opposite of this would be answering question #3: which rows were present in both DataFrames? To accomplish this, we can pass which='both' to our function. dataframe_difference(df1, df2, which='both') would look something like this:

id first_name last_name email gender department _merge
863 Marcia Whalebelly mwhalebelly1f@jugem.jp Female Accounting both
470 Averell Letteresse aletteresse1g@slashdot.org Male Accounting both
125 Berri Bullough bbullough1h@comsenz.com Female Product Management both
258 Vassili Drinkwater vdrinkwater1i@posterous.com Male Legal both
701 Laetitia Sarah lsarah1j@nationalgeographic.com Female Human Resources both
507 Hendrika Malec hmalec1k@ifeng.com Female Support both
163 Haskell Chiplen hchiplen1l@house.gov Male Engineering both
90 Gardiner Bonhome gbonhome1m@163.com Male Human Resources both
705 Arley Scapens ascapens1n@sitemeter.com Male Product Management both
127 Dacey Burrett dburrett1o@nydailynews.com Female Training both

Passing which='left_only'  and which='right_only' do exactly what you'd expect.

Feel free to take this snippet and use it as your own. I've added it to Github as a Gist here.

BONUS ROUND: Uploading Missing Rows to SQL Table

We can easily apply the function we just created to help us sync rows between two database tables. If we want to make sure rows between two SQL tables match, we can do something like this:

df1 = pd.read_sql_table('table1',
                        con=engine,
                        index_col='id')
df2 = pd.read_sql_table('table2',
                        con=engine,
                        index_col='id')
diff_df = dataframe_difference(df1, df2)
diff_df.to_sql('table2',
               con=self.engine,
               index=True,
               index_label='id',
               if_exists='append')

Pandas' to_sql() method has a nifty keyword argument called if_exists. If we set if_exists='append' when using to_sql(), Pandas will append rows from a DataFrame to an existing SQL table. Since we used our dataframe_difference() function to find which rows were different, we were able to ensure that we only uploaded rows that were different.