The massive data volumes that organizations use and produce are becoming increasingly valuable. However, businesses face key challenges turning that data into actionable information. A big part of that challenge is making sure organizational data is in the right place at the right time. All too frequently, making that a reality requires data movement.
Let’s set the stage. What is data movement?
Data movement is the ability for data to move from one place to a different location. The main goal of data movement is to ensure that the right data is in the right place at the right time. While that may sound simple, it is not, especially as your company’s data grows in size and value. IT professionals often find themselves spending more of their time looking to move data so that decision makers can use it for profitable insights.
Why is data movement important?
If data movement is taking data you have in one location or data store and making an identical copy of it to be used elsewhere, then the premise is if something happens to data in the first location, your data should also be available to your business from the secondary location.
Data movement is not a new concept, but the reasons for moving data have evolved quickly. Most organizations turn to data movement to address slow-moving but consequential use cases in their data landscape.
- IT Infrastructure– Your organization begins to see the end of useful life to its software applications or hardware infrastructure. Your data cannot stay there forever, so you decide to modernize, and migration to new assets becomes a priority.
- Volume – The amount of inbound data and your capacity to store and process it may exceed what you have on hand. It becomes necessary to move to different, higher-capacity storage.
- Value – New analytics tools and practices hold the promise of getting more value from data in unprecedented ways. You can use those tools to find customers, determine what they’re doing, engage with them more effectively and further optimize spending. But your organization has to get the data in front of new types of users without impairing performance, so you turn to data movement.
Methods of moving data
Data movement methods have evolved to keep up with changing business needs and improvements in database performance.
Extract, transform, load (ETL)
ETL, the method that has been around the longest, starts with a source of persistent data storage. A program or script extracts data from that source, then transforms it; for example, by converting the M’s and F’s in the Gender field to zeros and ones, or vice versa. The program then loads the transformed data into a different data store elsewhere for a specific use.
Extract, load, transform (ELT)
The problem with ETL is that it entails touching the data multiple times, with each touch taking time and processing power. Worse yet, if the ETL takes place on the same hardware used for the production application – say, your customer relationship management (CRM) or point-of-sale system – at that point it’s consuming precious resources.
That’s where ELT comes in. Rather than transforming the data before you copy it, you extract it from the source and load it into another data store, then transform it later. Why? Because if the extract is huge and you have no efficient way of searching through it, ELT can reduce the impact on scarce computing resources like memory and network.
Reverse ETL goes in the other direction: from the target system back into the source. It’s useful if you extract the data to add to, edit or delete some of it, or if you combine data for new insights. After transformation, you load the changed data back into your main data store.
Change data capture
Suppose an organization needs to move operational data from multiple sources, like Oracle and PostrgreSQL into a centralized location for use by a business intelligence platform. You can wait for the data to land in your storage bucket, search through it and extract just the data you want. Or, with change data capture, capture only the changes you need. The result is higher performance that can get you much closer to real-time processing.
Change data capture can go on continuously and keep the target data store up to date. ETL and ELT, on the other hand, are more useful as intermittent methods, say, every eight or 24 hours. Capturing change data is useful when the transactions directly affect availability, as in student registration systems, airline reservations and stock trading. Updating only once or twice a day would result in a distorted, inaccurate view of the true inventory.
Synchronization is a step beyond change data capture. Synchronization ensures all changes happen in as close to real time as possible in which adds, edits and deletions go to both source and target systems as close to simultaneously as practical.
Consider shoppers on a trip far from home. They try to make a purchase with a credit card, but the purchase is declined because the credit card company doesn’t know they’re traveling, or because it’s an unusual purchase. If it takes the company more than a few seconds to figure that out, the information will be useless because most shoppers will have abandoned the transaction. But if the company’s authorization system is synchronized with, say, its CRM, the company can promptly send a text message. “We noticed an attempt to make a purchase on your credit card from a store in Prague. Is this you?” The shoppers confirm it and the sale proceeds. It would be difficult to attain that level of responsiveness with ETL, or even with ELT.
Data movement markets
IDC defines two different sub-markets for data movement.
- Database replication has traditionally been used to make complete copies of databases for backup/recovery, disaster recovery or high availability. The source- and target-databases were usually of the same type and structure, so data movement was relatively easy. But with the evolving need for extensive insight into the data, replication has become only part of the story.
- Dynamic data movement is for movement between (mostly) dissimilar types and structures. As the amount of data increases and new, broader uses are identified, dynamic data movement is overtaking traditional database replication. Your organization may embrace open-source and NoSQL databases, and data streaming technologies like Kafka. That almost always requires moving data selectively, in real time between different platforms; for example, from Oracle to PostgreSQL, or to MariaDB.X
Achieve database replication for high availability, scalability and interoperability between platforms with SharePlexExplore Now
The obstacles to achieving dynamic data movement
As the amount of data being collected increases, and new and broader uses are identified, traditional database replication is being overtaken by dynamic data movement. Companies are adopting new ways to store and manage data, which almost always requires selectively moving data in real time between heterogeneous platforms. However, this introduces a wide variety of obstacles.
Understanding the data
Smooth data movement requires that you understand the data. In the journey from raw data to actionable information, the actual movement of the data is only part of the process. Making the best use of that data requires that you understand it. That includes data types, data usage and data quality.
All of the previously mentioned data movement methods require an understanding of the data. That understanding extends to the structure of the target system and the ways in which the data may need transformation. At the heart of understanding is one question: “What is the decision maker looking for and trying to accomplish?”
The most effective way to understand that data is to model it. Data modeling and the discipline it entails ensure that the data you’re moving has the quality required to support your business initiatives.
To realize the promise of the value in your data, the quality of that data must be exceptionally high. The data must remain accurate and true despite any data movement.
One variant of data quality is missing data. One database may tolerate an empty field where another does not.
Another example, different source systems may use different values as attributes, but accepted values in one system may cause data quality problems in another. Some systems use numeric values to represent marital status: 1 for single, 2 for married and so on. Other systems use characters like “S” or “M.” Combining rows from those systems will result in inoperable values for marital status (if it doesn’t cause a data-type mismatch first).
Consider a bank customer with both a business checking account and a personal checking account. One day, the customer asks to change the accountholder’s name on the personal account, maybe because of marriage or legal name change – or perhaps for illegitimate reasons. The business account remains unchanged. Now you’ve got the same accountholder in your system with two different names. How would anybody else in the bank discover that discrepancy? How would they reconcile the two names? To the bank it looks like two different people living at the same address. Is that the case? These types of scenarios aren’t uncommon, and further highlight the necessity of data quality.
Moving data between different databases often involves converting data types. A column of type NUMBER in an Oracle database, for example, could be converted to a column of several types in SQL Server, including TINYINT, INT, BIGINT, DECIMAL, NUMERIC, MONEY, SMALLMONEY or FLOAT. The same column being moved to PostgreSQL could become SMALLINT, INTEGER, BIGINT, DECIMAL, NUMERIC, REAL or DOUBLE PRECISION. Storage costs can increase if the target structure requires more space. Allowable ranges for each data type may differ between databases, which could lead to loss of precision or even an outright failure to migrate the data.
Beyond understanding the data, it’s important to understand how the data will be used. For example, eCommerce and banking may require physical, legal and mailing addresses to be collected; while for a CRM system a name and email for a prospect may suffice.
Whenever you move data from one system to another, questions arise around the source of the data, and how it may have been changed or transformed on the journey. Data lineage provides a full picture of the route a given dataset has followed into, across and out of your organization. It plays an important role in resolving differences between systems and in facilitating matters as simple as rounding floating-point .
Semantic mapping, or the definition of the data, is even more nuanced than data quality. If you sell to companies (B2B) and to individuals (B2C), how do you differentiate between different types of customers? For that matter, is it important for you to differentiate? If you’re a reseller or a manufacturer, who is your customer: the entity that purchases from you or the entity that consumes/uses the product? Can it be both?
In short, stumbling on these obstacles to data movement can leave your organization subject to fines, or to lost customers because you’re missing insights and trends.
Data modeling and data governance overcomes obstacles of data movement
Data governance paired with data modeling provides the full context needed to overcome obstacles of data movement. As a visual representation of data elements and the relationships between them, data models help business and technical stakeholders collaborate in the design of information systems and the databases that power them. Data governance aims to inventory, govern and protect data while enabling that information to be shared.
When done correctly, this forces you to understand your data by answering questions like “How is this data stored?” “Why are we storing it?” and “What does it mean?”
A data modeling tool lets you scan your data to understand it better. For example, it can go through your databases and scan for a Name field and an Address field, then highlight any anomalies for you to resolve. How much overlap between Name and Address indicates that it’s the same person? How many addresses can each person have? How do you make sure you associate the address with the right person?
Or, you can use the tool to search your data for strings of nine digits – maybe separated by dashes after the third and fifth digits, maybe not. In the U.S., that would likely correspond to a Social Security number, which is sensitive data that needs to be protected. How can you find all occurrences of that string of sensitive information? Data governance helps to ensure that data is protected.
The domain of electronic medical records is thickly populated with personally identifiable information (PII) like patient name, medical history and prescription records. How do you find all the places that PII has been stored and ensure you’re protecting it adequately? You don’t want to run the risk of a HIPAA violation. Your effort may either include or exclude that sensitive data, depending on the people who will work with the data after it has been moved. That’s why data governance plays an important role in ensuring that people have access to only the right data to perform their job, and no more.
When you use data modeling and data governance tools to understand and model your data, you take a big step toward smooth data movement.
Tips to get started with data movement
Review data structure
Use a cross-platform data modeling tool to reverse-engineer your source data store. That will help you identify data types that may cause problems down the road, and generate compatible data description language (DDL) for the target systems. Advanced data modeling tools are also designed to copy data from one system to another.
Get a handle on data usage
Catalog your data elements and identify valid usage for those elements. That ensures that the data will be suitable for new usage as it arises. A data catalog can also contribute to reducing your data bloat by identifying unused and unnecessary data elements such as multiple columns containing the same data.
Explore data lineage
Data lineage helps to identify all sources and targets of the data your organization touches, and all the transformations performed by each system. That will help you find, track down and deal with data discrepancies between systems.
Suppose your investors or industry regulators want to audit your stated sales. Can you trace the figures in your annual report or your Form-10K all the way back to your point-of-sale system? Can you demonstrate that the data wasn’t changed or manipulated on the journey, and that you’re not stating your financials incorrectly? Data lineage helps ensure that even if data is transported across systems, there is a trail of what was transformed in each step of the process.
Consider data quality
Perform data analysis and modeling to locate quality problems like missing or outlying values in your data fields, and discrepancies in the data definitions between systems.
Review semantic mapping
Mapping your data semantically shines a light on data entities that are duplicates or near-duplicates. With data modeling at the semantic level you can eliminate redundancies and clear up any confusion around data definition.
Data movement is inevitable; data governance is necessary and data modeling is optional, but it shouldn’t be. As part of an overall data governance program, data modeling is a valuable but often-overlooked step on the path to smooth data movement. By modeling data as a precursor or prerequisite to data movement, you identify important data, establish where to find it and ensure you know where it needs to be. As a result, it is much easier to get the right data to the right place at the right time to help stakeholders make data-driven decisions.