Modernizing your data analytics infrastructure
Dan Murphy, June 15 2023
An example business intelligence interface. Image courtesy of Metabase.
Great expectations
It’s easy to get lost in the buzzword mania surrounding “data analytics”. There are data warehouses and their much more fun sounding cousin, data lakes. Snowflake invites you to “Join the Data Cloud”.
Amazon Redshift seems to promise quite a lot when it says it will enable you to “Break through data silos and gain real time and predictive insights on all your data in a few clicks, with no data movement or data transformation.” Ah yes, adopt Redshift and the insights just ✨ appear ✨ as if by magic.
I don’t mean to imply that these concepts and services aren’t useful. They certainly are under the right circumstances. But my unscientific opinion is that people at smaller organizations are sometimes daunted or turned off by the grand promise of “data analytics”, when in fact there are some relatively simple things they can and probably should to do start getting value out of data they already have.
The problem (and opportunity)
The problem that I’ve seen at a number of companies over the years is that while people in the business do want and need data to do their jobs, the only means for getting them that data is a not-very-helpful reporting system that was created as an afterthought.
A few real use cases from companies I’ve worked at:
- Accountants at a fintech need a list of all the transactions processed for a given time period, grouped by the entity associated with the transaction.
- The CEO of a SaSS platform needs subscription and churn info for all customers.
- The head of marketing needs to know how prospects on the sign up page respond to different pricing structures.
- A product manager wants to know how long tasks within the app take to complete, grouped by certain categories and segments.
Fill in the blank for your company. People need or want data like this to make informed product roadmap and pricing decisions, they need it to provide to auditors or to present to investors, or they want it to keep everyone informed and aligned at the company All-Hands.
Unfortunately, it’s commonly the case that the reporting system is very limited because it is some one-off code that an engineer wrote to satisfy a specific request, and that code was then added to over time. The code fetches some data from the company’s main application database and maybe one or two integrated services and is then rendered on a page in the company’s app, or exportable as a CSV for use in a spreadsheet. If you’re lucky maybe it automatically uploads the data to Google Sheets for you.
The consequences of this are that:
- The system is essentially hardcoded to provide a specific set of data. The accounting team asked for transaction data, so you made a page that pulls accounting data from the database, formats it, and let’s you view or download that specific data.
- As a result, if anyone wants new data or changes to the existing format they need to go through the product and engineering teams. They file a ticket or bug someone in Slack. There’s a conversation about prioritization (suprise! It’s not a priority). Someone makes a PR, gets it reviewed, deploys it. A few weeks after the request the new data is live.
- The data is not flexible. Even though you only want three data points you have to download the entire gigantic CSV, and sometimes the download times out. You can’t combine the accounting data set with the subscription data set.
So what to do? Is it time for a data warehouse? A data lake? Does the team need to learn Snowflake or Redshift? As with most things in software development, it’s wise to first consider what small incremental steps you can take to improve your situation.
The good news here is that in my experience there is a tremendous opportunity to provide real value to the organization with relatively low effort.
Answer your own question
A great starting point is to set up a “business intelligence” (BI) interface. Basically an app that provides a read-only connection to your database along with conveniences such as graphing, setting permissions for who can read what tables, and a query GUI for folks that aren’t comfortable with SQL.
Metabase is a great option here. Affordable, simple to set up, and very feature rich as you start to lean into it.
Setting up such an interface provides many benefits. The immediate win is that people can start answering their own questions without making a request to product and engineering.
They can see or search for what tables exist in the database, make queries with the GUI, and get the info they need on screen or exported to a CSV or XLSX.
No waiting on a ticket to get filed, prioritized and completed, and no extra work for product and engineering. Some sophisticated users will figure out how to JOIN tables together. Grouping, filtering and sorting large data sets is much faster than it was in Excel or GSheets. As many analysts know, with enough rows in a sheet you often can’t do these things at all because the program will freeze.
That’s all well and good, but there is a less obvious secondary benefit to doing this. In many cases the user won’t be able to answer their own question.
They’re having trouble understanding the database schema - who could blame them? Even the engineers find it confusing.
Or they need a very complex query to achieve what they want (I’m looking at you JSON columns and six table JOINs). You may see queries that are attempting, likely with great effort and levels of error, to duplicate some of your application’s business logic in SQL.
Maybe the data they want isn’t in the primary application database but in a third-party provider’s.
These problems coming to the surface is a good thing because it provides signal on how you can make your data even more useful.
It tells you what data is missing, what data is hard to find and what data is hard to query.
Moving upstream
Now that you have some sense of what your data shortcomings are, you can come up with a plan to address them. Typically this is done via some kind of ETL (Extract, Transform, Load) process that takes data from one place, reformats it (for ease of querying in this case), and uploads it to another place.
Rather than building a page and CSV export for a particular query, your ETL system should endeavor to build flexible data sets that can be used to satisfy many different types of questions.
To continue the accounting example, maybe instead of publishing a data set that answers the specific question “how much revenue have we recognized for each entity year-to-date?”, the data set you’d produce would be useful for answering any accounting related question. It might take the three or four key tables involved, JOIN them together, filter out superfluous fields, flatten out some JSON columns, and rename columns so they’re easier for non-engineers to understand.
This process essentially takes a very complex end user query and moves it upstream, so that an automated system is doing the work of combining and formatting data rather than business users who may have limited knowledge of the data and the query language they’re using.
For a small company it’s likely possible to satisfy the majority of query requests by building a small number of such general purpose tables.
One very flexible data format here is an event log. Rather than a static view of some data at a given point in time you can record immutable events for the key actions in your system: “Thing A happened at Time B, with Value C”. These events can be rolled up into nearly any variation of the data you can imagine, and for any time period.
When I worked at Wunder Capital we got a lot of mileage (many different questions answered) out of two event logs that tracked all activity in our underwriting and task management systems.
Great, we know we can add value by providing more data and by making it easier to query. We can do so flexibly via general purpose tables or event logs, rather than highly specific data sets for a particular question.
But how should we go about producing these flexible data sets?
Data pipelines
Sweet, sweet data pipelines. Image courtesy of Hevo.
There are plenty of companies that provide ETL pipelines as a service, such as Hevo and Stitch, and you’ll hear them refer to these systems as data pipelines.
The big value adds from services like this are:
- They integrate with a ton of systems. Different types of databases (SQL, noSQL, data warehouses) and popular services you may be using such Google Analytics or GitHub.
- They handle all of the infrastructure and processing you need to do do the ETL. No need to spin up infrastructure on demand to handle the workloads, manage service rate limits, or write code to make API calls to get data from your service providers.
- They have built-in support for transforming the data. Given your input data’s schema you can rename, remove, add or combine fields for the output schema.
- In the case of importing data from databases (as opposed to from third-party service APIs) you can get near real-time results. Updates from the input database are continuously streamed to the data pipeline service, which feeds them into your output database¹. My experience is that end users always want the data to show up faster. If you’re writing your own code do the ETL and running that code every hour or two, then the data will likely be fresh enough for most reporting and analytics use cases. But there will be exceptions to this. Undoubtedly someone will change some data in your application and expect it to show up in your query interface seconds later so that they can present to investors/the leadership team/customers. And so this is a very nice feature.
For a lot of companies I think adopting one of these tools can be a big win. They aren’t hard to set up, are priced very reasonably, and can save you significant engineering time building a similar (probably lower quality) ETL system on your own.
Roll, roll, roll your own?
Is your best option necessarily to use a data pipeline service? I think in the majority of cases, yes, you should use an off the shelf solution.
But a real shortcoming of doing your ETL through data pipeline services is that if you want to leverage your application’s business logic for transformations you’ll need to duplicate that logic in the service.
The more complex this logic is and the more frequently it changes the harder it’s going to be to faithfully replicate it in the data pipeline service, which will have you specify transformations in a GUI or via a query language like SQL.
Some services, like Snowflake’s Snowpark, let you specify transformations in languages like Python and Java. But even if your application is written in one of those languages you’d need to duplicate any business logic you want to use.
When I worked at Wunder Capital we had a rules engine written in Ruby that users could configure at runtime. We also had complex set operations and graph traversal written in Ruby.
To replicate those transformations outside of Ruby in a third party platform would’ve been very difficult, and one of the main things we cared about was consistency between the way our main application presented data and the way it showed up in our query interface.
If you need to perform transformations like this and your application already has some of the key ingredients for an ETL system, such as:
- Background job processing infrastucture. Ideally this infrastructure supports retrying failures and processing big workloads in parallel.
- Integrations with the APIs and data stores you care about.
- An ability to auto-scale infrastructure to quickly process heavy workloads.
- Read-replica databases so that your auto-scaling workers don’t overwhelm your primary database.
Then it isn’t so difficult to set up some scheduled jobs that take your data, perform the complex transformations you need using your existing business logic code, and bulk insert/upsert the transformed data to an analytics database connected to your query interface.
This is exactly what we ended up doing at Wunder. Maintaining our own ETL code so that we could leverage our main application logic to do the complex transformations we needed.
There are big downsides to doing this! The cost in terms of engineering time will be high. You may incur significant infrastructure costs to process the data quickly. Running all of those background jobs may trip your service rate limits or negatively impact primary application performance. But for our organization those downsides were worth the ability to make the results of those complex computations easily queryable.
Do you need a data warehouse?
When setting up your BI interface and data pipelines what type of data store should you connect to? Can you simply connect to your primary database? Do you need to adopt a data warehouse? What even is a data warehouse?
The main options for your analytics data store are:
- Your primary database.
- A read-replica of your primary database.
- A separate database of the same type as your primary.
- A data warehouse.
Which option you should go with depends on how much data you have, what kind of load you put on your primary database, and what data stores you already have set up.
If the only database your organization currently has is your primary then clearly the easiest thing to do is simply to connect to that database. The BI connection will be read-only and so there’s no risk of users mutating data as they ask questions.
The big risk there is performance. Will data pipelines or users hammering your primary database with slow analytics queries impact the performance of your primary application? Will the analytics queries themselves be untenably slow?
Using a read-replica insulates your primary database from the impact of the analytics queries, without your team needing to get familiar with a new technology like a data warehouse. Setting up a read-replica is a common practice that can generally help your application scale database reads without overwhelming the primary, so you may want to do this even absent a need for supporting analytics infrastructure like BI tools and data pipelines.
If you have set up data pipelines then you’ll need to be able to write their output somewhere. In that case you can’t rely solely on a read-replica of the primary. You’ll either need to write to the primary or to a separate database. There’s no reason that separate database can’t be of the same type as your primary. When I worked at Wunder Capital we set up a new Postgres instance for our data pipelines to write to, which served as an easy way to insulate the primary from all of those writes while not having to adopt a new technology like a data warehouse.
If the analytics queries themselves are intractably slow or if your storage costs on a traditional database are very high, then a data warehouse can help you out.
A data warehouse is just a database specialized for analytics workloads. Typically they offer:
- Very cheap storage. So if you’re storing petabytes of data this will be a lot cheaper than storage in your typical RDBMS.
- Columnar rather than row storage, which can greatly speed up analytics queries.
- Parallelized processing of queries, which again can greatly speed up analytics queries.
- Compute infrastructure for performing transformations on your data.
In practice though I think most organizations don’t need a data warehouse to start making use of their data. If you have a well-indexed SQL database there is often no reason you can’t comfortably run your analytics queries on tables with many millions of rows.
Metabase has a good post that goes into more detail on the considerations for choosing an analytics data store.
Pragmatism
At the end of the previous blog post I said “Radiant Software’s primary concern is always pragmatic problem solving”. Hopefully this post has offered some pragmatic ways in which you can improve your organization’s ability to leverage data, while also surfacing some of the nuances in a build-vs-buy decision for data pipelines.
There rarely are choices in software development as black and white as “yes, *all* organizations should do X”, and so the key thing is to think critically about your specific situation and needs before adopting the most common solution or tool.
Could your business benefit from expert software engineering, product development or team building? Get in touch at hello@rdnt.software or visit https://rdnt.software/ to get started.
Radiant Software builds exceptional digital products using Ruby on Rails, React, Node.js, Python or your technology of choice. We can accelerate your product development process with rapid prototypes, A/B testing, and analysis of technical feasibility. Or leverage our experience to level up your engineering team with technical interviewing, mentorship, and training in modern software development practices.
Footnotes
- In the case of a SQL database this is going to leverage the same Write Ahead Log (WAL) that is used to support read-replicas.