I’m using the HubSpot integration with BigQuery and having trouble with deleted contacts showing up. I understand Stitch doesn’t delete the rows when rows in the data source are hard-deleted, but is there a way for me to filter out those deleted contacts in SQL?
Duplicate content when synchronizing aurora mysql to bigquery
I recently went through this process with Stitch. The root cause of Stitch not cleaning up deleted contacts is due to Hubspot’s lack of an api to pull/identify deleted contacts.
Due to the Hubspot tap being a ‘Paid’ tap I was hoping this could be resolved - but there are currently no plans to address this. There are various ways Stitch could solve this - but most of the solutions I’ve proposed do not fit into the Singer specification. Again, Hubspot’s api is the problem.
Here is current solution for orphaned hubspot data in the stitch destination:
- We have a Zato ESB instance (https://zato.io/) that is the center of our data integration universe. This is consolidating all forms of random php/shell/zaps/integromat/etc into one spot. Python based, pretty good docs, and community. For you this could be some php/python code running in a cron job, etc.
- Wrote some code that pulls all Company, Contact, and Deal ID’s once a day via Hubspot api and loads them into 3 tables that sit next to the hubspot tables in the stitch destination database. Those tables are then used to identify any orphaned (deleted) rows in the various stitch hubspot tables - and delete the rows. The queries are run via the BigQuery python library - which is only a handful of lines of code to get er done.
Another approach is to use Hubspot Webhooks. These are configured via a dev portal - which is not the easiest thing to setup! (lame manual process for “installing” the “app” into your “portal”). With webhooks you can basically get a event notification from Hubspot when objects are deleted. This of course requires you to have infrastructure in place to receive the webhooks so the delete can then be processed on your end.
We do have everything in place with Zato to process these webhooks. Once our full Hubspot to On Premise data integration is live we will turn this functionality on.
There are other areas of the hubspot tap that has been making my life difficult…
I’ve been asking myself why I am paying for Stitch right now?
The answer to this question is as follows:
- It generally meets my ‘it just works’ criteria.
- As we grow our Hubspot implementation and usage - we will need to keep an eye on our API usage (max 40K per day). Above 40K means $500 a month to Hubspot for additional api usage. Stitch helps us mitigate that since their API usage does not count against ours. (Pretty sure that is the case.)
- The Hubspot tap is dynamic and adapts to our ever changing Hubspot properties.
- As long as we can manage our row counts in the $100 a month tier it’s a great value add.
- It is absolutely critical to have a full copy of Hubspot data in a relational database when doing complex data integration with other information systems. Eg: Sql queries to identify large datasets for cleanup or updates.
Thanks for the really comprehensive reply!
You’re welcome Bert.
Now that I’ve had this up and running for a few weeks, I’ve identified some issues with this approach.
Since stitch has no customer facing api or webhooks that allows me on the destination end to know when a stitch data load is happening - it’s very difficult to ‘sync up’ my code that pulls all the ID’s from hubspot with the data loads occurring from Stitch. The process to pull the id’s takes about 20 minutes (Hubspot only allows retrieving 100 contacts at a time!). So, by the time I have a current copy of all ID’s from hubspot, Stitch could have loaded newer data into the destination. This would mean there are IDs in the stitch destination tables that are not it my master id list - and the result would be deleting rows in the stitch tables that should not be deleted. The problem is that If they do overlap, or something fails, etc - we can inadvertently delete rows from the destination tables that are not really deleted in hubspot.
Even using stitch ‘anchor times’ I’m finding that it’s very difficult to ensure these processes do not overlap somehow. I’ve asked Stitch for some simple tables in the destination that inserts rows when a data load is running/completed/failed/etc so that dba’s on the destination end know what the heck is going on. They were sort of like ‘meh.’
I’m back to being frustrated that Stitch has product that claims to replicate your data from Hubspot, but it has a huge asterik. Not removing rows in the destination that were deleted in the source is just plain broken. I’m also frustrated that for a company that has enterprise pricing levels - they do not have simple features that allows DBA’s on the destination to know that Stitch is doing something.
I get it Stitch - Hubspot’s api’s suck. I’ve raised this issue with our Customer Success Manager, on the hubspot forums, etc. It’s failing on deaf ears.
Has stitch tried using some of their clout with Hubspot to get some improvements to their api?
SUMMARY: Hubspot needs an api endpoint to return deleted contacts, companies, deals, tickets, etc.
Thanks for the thoughtful replies on this topic. The approach sketched out here of periodically comparing a fresh snapshot of all Hubspot contacts to the contacts in the warehouse and deleting any that are not in the snapshot is the best option at the moment given the limitations of Hubspot’s API.
One suggestion that may help with your timing issue: Stitch appends an “_sdc_batched_at” column to all of the data it replicates, which will contain the timestamp that each datapoint was copied into the warehouse. During your deletion routine, you could use this column to exclude data that was recently loaded, following a process like this:
- Record the maximum value of _sdc_batched_at in your contacts table
- Then run the routine to pull a fresh contacts snapshot from Hubspot
- Delete records in your contacts table that are both not contained in the snapshot and have an _sdc_batched_at value less than the maximum you recorded