Where to learn how about creating my first data product


#1

As a data analyst/business analyst, I have some very basic data engineering experience (e.g. I set up our analytics warehouse, maintain the database models, write Python scripts on my laptop to ETL some Google sheets/Google Forms datasets), but would love to figure out how to move into the “next level” and learn how to make proper “data products” that live on their own and that other business users can interact with easily.

For example, at my current company (Mac & Mia), I have a hacky way to calculate a user “style score” based on the photos a user likes on our website. I grab the user data from our production database, run it through a little Python script on my laptop and spit out the results into a table in our analytics warehouse. Then, I have a dbt model that runs on top of that data in our analytics warehouse to connect it to the rest of the user data which we can then see in Looker around 4 hours later (or more if I don’t have my laptop on to run the cron jobs). The delay has been acceptable for now but I would love to learn how to set up a little web server that can accept POST requests, do some calculations or transformations on the data and return it back to the requesting client for whenever the speed does become more mission critical.

I have what I believe to be a general grasp of the concept but start to get overwhelmed by the vast array of options (where do I host it, what’s the right API endpoint structure, do I need a message broker) I seem to have when I go into setting up a proper web app for it.

Anyone have recommendations for where to get started? Is this something that’s best learned on a place like Udemy?


#2

Hey! That’s a great question. For scheduling and workflow purposes, I know a lot of dbt users use Airflow (it has good documentation too). Fishtown Analytics also has a workflow tool called Sinter.

If you want to stick to cron jobs, perhaps you could host them on an EC2 server instead of your local machine. That way there won’t be a delay in case your machine isn’t on.

Hope this helps!


#3

At RealSelf I built a framework in AWS to do what I want. I have a Cloudwatch rule which executes a Lambda function on a schedule (it uses a cron expression). The Lambda function sends a message to a Simple Queue Service (SQS) queue, containing the command I want to run (a bash script, python, or whatever). That’s the task queueing side.

On the task execution side, I have an Elastic Container Service (ECS) cluster containing one Elastic Compute Cloud (EC2) instance. That ECS cluster is hosting a service that runs a task definition that uses a Docker image which I have pushed to Elastic Container Repository (ECR). That docker image is built to have all the environment variables and credentials files I need to do my tasks. The entrypoint of the docker image is a python script containing a while True: loop. That loop checks github to see if it needs to pull fresh code from the repository where my scripts are located (and does so if needed), then it checks SQS to see if there are any messages waiting. If there are messages in the queue, it polls (pulls?) them, and executes each of them in individual subprocesses. It then goes to sleep for 60 seconds before starts the loop over again.

It sounds complicated, and it’s true that there are many moving parts, but each moving part really only has one thing it’s trying to do, which makes it remarkably stable. Cloudwatch’s only job is scheduling. Lambda’s only job is sending messages to SQS. SQS’ only job is to act as a task broker. The looping python script’s only jobs (okay it has more than one) are to keep the code current and execute the scripts that do the actual work. As long as I don’t mess with the docker container (changing versions of pip packages for instance) everything just works. And if the docker container does exit (which it will do if/when the looping python script exits for any reason), the service config in ECS will just start up a new container.

I’m sure there’s probably a better way to do what I’m doing, but this has worked pretty well for us.

This isn’t the most beautiful diagram of it, but it might help to visualize it:

It’s noteworthy that in this environment, most of the python scripts we run use Luigi for orchestration, but you don’t have to do that. You should be able to call DBT from within Python, either in a subprocess, or by importing the dbt package (this has been discussed several times over on the DBT Slack group, but isn’t well documented). If you’re a Sinter user, you can also use the pysinter package (which Connor @ Fishtown and I have collaborated on) to call the Sinter API, and if you use Airflow, you can use the Sinter operator (which itself uses pysinter).


#4

Thanks, Michael! This is a very helpful overview of the whole framework/ecosystem. Really like how independent all the parts are in your setup. I’ll have to work my way up to this but it’s a really good view of all the parts I might need.


#5

Hey @akmohr …your question is interesting because it points out that solving the problem you’re speaking of is more complicated than it should be!

Couple ideas:

  1. Use stitch to push data into data warehouse. Then run DBT transforms via Sinter tool on a schedule (rather than running Python)
  2. If you’re set on using python, as @akash says, a lot of teams are using Airflow for this sort of workflow management. There are probably some good tutorials for setting this up.
  3. @michael.dunn 's solution is a solid one when not using either of the above. As a variant on that idea that’s a bit easier to get started with, but not a long term solution: use the AWS cloudwatch to trigger a lambda on a schedule, but instead of using that to push into a queue, you can just run your python script in that lambda function. You can have that function hit your db, ETL some data, and push into data warehouse. It’s not ideal long term to be doing database work from a lambda, but it’s a way to quickly iterate on the idea.
  4. Set up web server as you describe with endpoints / your code. I’d probably use something like Python’s Flask that’s pretty lightweight to set up.

If you can do your code as SQL queries in DBT, I think that’s the most straightforward option when coupled with Sinter. Failing that, Airflow is probably your next best option to explore since it was built with this purpose in mind. All in all, I’d pick a path that feels approachable and take it piece by piece (vs. trying to get it all working end to end right away)

Good luck!


#6

I second @jeffthink 's response above, if you can go with #1, that is the way to go and seems it’ll accomplish what you need. I use Stitch to pull all data from our production databases into our warehouse, then rip through them using DBT, and Sinter schedules the job, archiving, data testing, all of it, and I get the data I need to give to others or for my BI tool to query.