With the right DataOps architecture, you can get much more out of the investment you’re making in analytics by addressing how you deliver data to the organization.
When it comes to using data to improve your business or find new markets, the real value lies in the insights you derive from your data. The quest for insights drives investment beyond traditional business intelligence, to analytics, machine learning and artificial intelligence. But because your investment is only as good as your data, it’s important to understand the details of where the data came from and how it has been processed.
Otherwise, you risk applying the wrong data to the right question and getting the wrong answer. DataOps is about ensuring that you apply the right data to the right question and get the right answer.
What is a DataOps architecture?
Organizations are investing heavily in modern analytics and AI and machine learning analysis models. However, according to a recent study on the state of data governance and data empowerment, 50 to 70 percent of enterprise data is classified as “dark data,” otherwise defined as data that is stored but not used. Because this data is not understood or used, the complex analysis tools that are meant to add clarity and ROI to this data are at risk of being used incorrectly simply because the underlying data isn’t ready for analysis.
It’s useful to clarify the difference between DataOps and “data operations,” a term that has been around since the dawn of IT. The latter refers to managing data processing and the daily operations of databases, whereas the former refers to how you provide data to analysts and decision makers. DataOps applies to the last mile, where people begin consuming data with little real understanding of where it comes from and what has gone into it. They don’t work with the data regularly, so they have only a sort of tribal knowledge of it.
A DataOps architecture is the combination of applying the agile and DevOps principles to data. It’s for introducing modern paradigms of software development to the way you provide data to the organization.
Legacy data architectures vs. DataOps architectures
Legacy data architectures were typically designed and built for specific organizational needs.
A typical example is the traditional data warehouse and datamart architecture. They were designed for data aggregation to answer typical BI questions—for example, year-over-year sales performance—but that approach comes with many constraints.
Typically, this data was transformed prior to loading to ensure consistency in the data warehouse tables. This would potentially degrade the wider applicability of that data as it was no longer in its raw form.
For example, sales data from different sources may have different levels of precision. When aggregating data from these sources, architects will often use the lowest common denominator of precision (decimal points), losing detail in the process.
Additionally, deploying new data and capabilities in the data warehouse was cumbersome. In many cases, historical data would need to be retrofitted to these new structures, again changing the raw value of the data. These requirements meant that the adoption of new use cases and data represented significant risk and latency in meeting the needs of the business.
Since the data required heavy processing to get to a place where people could finally use it by the time the data was used, it’s no longer in its true and original form. Legacy data architectures were well-suited to their purpose, but they did not readily accommodate changes. As a result, they have a reputation for being brittle and inflexible.
A modern DataOps architecture allows for new data and requirements — even in real time — to be added or modified with a minimum of interruptions and latency in the data flow. It also allows for the concept of a fabric, which makes it clear what that data is, what its quality is and how you should and should not use it.
With a fabric, you have different options for data sources for a specific use case and the data is left in its raw state. That way, you can repeatedly query the data in new ways without restarting the process. Your data is always there, waiting.
What are the key components of DataOps architecture?
1. Data sources
Data sources form the core of a DataOps architecture, encompassing databases, applications, APIs, and external systems with structured or unstructured data, located on-premises or in the cloud.
A robust DataOps architecture must overcome integration challenges, ensuring clean, consistent, and accurate data. Essential practices like data quality checks, profiling, and cataloging maintain an accurate and up-to-date view of the organization’s data assets.
2. Data ingestion and collection
Data ingestion and collection involve the process of acquiring data from various sources and bringing it into the DataOps environment. This process can be carried out using a variety of tools and techniques, such as batch processing, streaming, or real-time ingestion.
In a DataOps architecture, it’s crucial to have an efficient and scalable data ingestion process that can handle data from diverse sources and formats. This requires implementing strong data integration tools and practices, such as data validation, data cleansing, and metadata management. These practices help ensure that the data being ingested is accurate, complete, and consistent across all sources.
3. Data storage
Following data ingestion, it is necessary to store the data in a suitable platform capable of accommodating the volume, variety, and velocity of the processed data. Potential data storage platforms include traditional relational databases, NoSQL databases, data lakes, or cloud-based storage services.
In the development of a DataOps architecture, attention must be paid to the performance, scalability, and cost implications of the chosen data storage platform. Furthermore, considerations regarding data security, privacy, and compliance become crucial, particularly in the management of sensitive or regulated data.
4. Data processing and transformation
Processing and transforming data entail manipulating and converting raw data into a format suitable for analysis, modeling, and visualization. Operations in this phase include filtering, aggregation, normalization, enrichment, as well as advanced techniques like machine learning and natural language processing.
Within a DataOps framework, it is advisable to automate and streamline data processing and transformation using tools and technologies capable of managing substantial data volumes and intricate transformations. This may entail utilizing data pipelines, data integration platforms, or data processing frameworks.
5. Data modeling and computation
The processing and transformation of data involve manipulating and converting raw data into a format suitable for analysis, modeling, and visualization. Operations may involve filtering, aggregation, normalization, and enrichment, along with advanced techniques like machine learning and natural language processing.
Within a DataOps framework, it is advisable to automate and streamline data processing and transformation using tools and technologies that can manage substantial data volumes and intricate transformations. This could entail employing data pipelines, data integration platforms, or data processing frameworks.
The advantages of DataOps architecture
The main advantages of a DataOps architecture are shorter time to insights, ease of adaptation and enhanced data lineage.
Shorter time to insights
Traditionally, data sources were managed as islands of data, say for example creating or capturing transactions. You centralized all of the data from different sources in a data warehouse and massaged it to fit together. And then your decision support (or business intelligence) was a series of standardized processes you ran against the data warehouse to answer questions like “How are sales this year versus last year?” They were well-defined questions applied to the data.
But there was always latency. You pulled all the data from the transactional systems at a given point in the month, massaged it and developed your analysis. Two weeks later, you might update it for month end, then once a year at year end. You never had the data at the moment of availability because the processes took time to extract, transform, load (ETL) and massage. Thus, people were always waiting for data.
Ease of adaptation
Often the process of massaging data introduces change or degradation in some way. For example, instead of individual transactions, you got aggregations or collections of transactions that addressed your first question but were not easily reusable for subsequent questions. To get the raw data, you had to start over with a query of the originating systems, which is troublesome and time-consuming. Besides, it adds to the workload on those systems, slowing them down while your co-workers and customers are trying to use them to conduct live business.
The essence of modern analytics is that you don’t always know the questions you want to ask of the data, but when they occur, you want answers fast. The legacy architecture processes required for working with data were made for questions that used the same dimensions to examine the data.
That’s why DataOps architectures look at data as a pipeline by considering its sources and where it’s being consumed. The goal is an uninterrupted flow — as close as possible to real-time — to reduce latency. Instead of ETL, they focus processes on ELT (extract, load and transform) so that you can change the data to have it fit the specific questions you’re asking. It leaves the raw data there for subsequent access and subsequent questions.
Enhanced data lineage
In deciding whether to use data or not, people want to know where the data has come from and what has happened to it along the way. That’s the role of data lineage.
DataOps architectures provide intelligence around the data so that the people who consume it trust it and know that they have the right data for the questions they’re asking. Data intelligence leads into data governance, or ensuring that people know how they should and should not use the data. That’s at the heart of protecting privacy and personally identifiable information (PII) to mitigate the company’s risk.
People also want to understand what the data means, how it relates to the business and whether the data source is of high quality. Low-quality data contains defects like duplicate records and dummy records that can skew results.
What to account for when adopting a DataOps architecture
While the DataOps architecture is advantageous to most enterprises, it does not spring up on its own. Companies have to account for several changes in their data landscape as they try to move to a DataOps architecture.
Data governance
A data governance framework enables people to quickly decide whether and how they can use the data. It provides the answers to basic questions:
- What is this data?
- What’s in it?
- What are the organization’s rules around it?
- Who has permission to see it?
- Who does not have permission to see it?
- Is it associated with privacy regulations?
A data catalog is the repository of all that information.
Regulatory compliance
The insights mentioned above rely on using data for analytics, and some of that data is a by-product of customer transactions. You’re responsible, therefore, for doing the right thing with customer data and respecting the wishes of people who want to be anonymous to or forgotten by your business. That’s becoming a fast-growing consumer right, with laws and regulations that you must observe whenever you analyze data.
Consider the General Data Protection Regulation (GDPR) in the European Union, or the California Consumer Privacy Act (CCPA) or The New York Privacy Act in the U.S. Their goal is to ensure that people in those regions are not personally identifiable through the data your organization is collecting. A big part of that is not wanting customers’ identity and behavior to go through your analytics engines. Misuse of data can lead to financial penalties and loss of reputation. It’s important to ensure that you’re not putting the company at risk by accelerating your analytics program without the required controls.
Data quality/data preparation
If your internal users don’t understand the data, they’re likely to miss important aspects that could potentially skew their results. That also applies to the need to deliver high-quality data or, if you cannot guarantee high quality, at least a clear indication of the quality of the data you’re delivering. Without that, your analysts can’t see how to adjust their algorithms to account for imperfect records in the data.
Visibility measurement and monitoring for quality should underpin your DataOps architecture so users have knowledge about the data they are using to develop analysis. They can then decide whether to clean up the data themselves or adjust their models to compensate.
Stakeholder communications through data modeling
The DataOps architecture thrives on communication between the people who provide the data and the people who create algorithms to analyze the data. One of the best vehicles for that communication is data modeling.
A data model is a visual representation of all available data, indicating types of data, how it’s organized and how its components connect and relate to one another. It doesn’t have to be deeply technical as long as it plainly shows users the relationships among data.
For example, suppose that you don’t know your organization’s data very well, but you know that you want a specific subset of customer information. From a data model, you can see the data points you want and take your request to the data administrators. They look at it, compare it to the different models they have of the production data and quickly see the connections and differences. Then they can figure out how to get the customer data from where it is now to a position where it can be better analyzed.
The data model is integral to the requirements piece of DataOps. It helps users see which data they have, which data they need and how they can get it into a useful format.
Mistakes to avoid when implementing a DataOps architecture
The move to a DataOps architecture can be unfamiliar to companies that have more of a legacy data architecture. Keep an eye out for several common pitfalls.
Investing in analysis instead of the data
In moving from the heyday of business intelligence analysts, data analysts and business analysts, some companies will invest heavily in data science teams, technologies and high-end analytics tools. The data scientists bring in statistical analysis capabilities for applying artificial intelligence to the data to find new pockets of value in it.
That new approach to analytics may seem like a silver bullet, however only if you’ve made a similar investment in your data. You should first address the data problems described above:
- Low quality
- Latency in delivery
- Poor understanding of the data
- Lack of data model for communicating users’ needs
So, you can invest heavily in analytics, but also be mindful to invest in your data.
Hard-coded data pipelines
With a modern DataOps architecture, it’s tempting to think that nothing bad can happen when you know where your data pipelines are. True, they’re still just a combination of databases and processes of moving data from one place to another. But it’s a mistake to hard-code in, without thinking about or building in agility and adaptability.
Suppose you hard-code the points where the data enters and where people consume the data. What if consumption suddenly spikes, or there’s a major disruption like a pandemic that makes your workers switch en masse to working from home?
Managing the pipeline without hard coding involves active performance monitoring, tuning the data through replication across your environment and pivoting based on continually changing requirements of your users.
More data than the business can digest
In most companies, the people on the data management side are overwhelmed by the amount of data they have. But the people on the business side claim that they’re missing opportunities because they don’t have enough data to satisfy all of their use cases.
If the business side of your organization is telling you that data is not there, then they’re unlikely to trust or effectively use the data they do have. Unfortunately, this can undercut any sort of serious analytics and data-driven approach to business. Once people lose trust in the architecture and processes, they tend to go outside and develop back channels to the data they think they need. Following the path of least resistance can lead users unwittingly to data they don’t understand and cause them to make important decisions on faulty insights.
5 best practices for implementing DataOps architecture
What goes into a smooth implementation of a DataOps architecture? Consider how the best practices below fit into your organization so you can get your arms around your data for a smooth handoff.
1. Catalog data assets and pipeline
The data catalog is the key to understanding which data is available, where it’s located and how it flows throughout the organization. Through metadata management at the heart of the data catalog you can describe your data in different contexts for different use cases and user profiles. Automation driven by metadata also helps deliver data pipelines faster.
2. Apply a data governance framework
As described earlier, a data governance framework holds detailed information about every data element and data source. It should be a single source of all relevant artifacts as the data is developed, stored, managed, transformed, moved, integrated and consumed across your enterprise. A broad framework enables you to use data intelligence to protect the data, mitigate risks associated with it and capture the business value in the data.
3. Monitor, document and remediate data quality
Without acceptable data quality, users will grow to mistrust the data. That undermines the data-driven decision making that is fast becoming table stakes in business. Database professionals are at the forefront of ensuring that the right data is accessible to all users for analysis and reporting.
4. Automate data source and pipeline deployment
Only through automation can you keep up with the pace of new data creation. With automation in place to ingest new data, you don’t have to wait for people to operationalize it — it happens as part of the daily or hourly refresh. It ensures that users can reach and understand the data and decide whether it’s suitable for use and decision making.
5. Monitor and optimize data source and pipeline performance
With monitoring and measuring in place, you can ensure the data is available where and when it is needed. Your goal is to provide an independent view into how data is flowing and whether you’re meeting your service level agreements. Your monitoring structure should include processes for quickly addressing any problems users identify in the data.
Conclusion
The essence of a DataOps architecture is collaborating and bringing business groups together by breaking down the silos that exist between data and the business that uses it. This flexible architecture offers a common ground where people with different perspectives, data needs and use cases can work together on a level playing field and get the greatest value out of your organization’s data.