What is it and how to manage it
When it comes to enterprise applications, access to data – and lots of it – is usually a good thing. And the greater the volume of required data held locally to where it is processed, the better for the business, its applications, decision-making and, in some cases, compliance.
But the need to store and manage data brings its own problems too, including higher costs, lower system performance, and management overheads. Here we are dealing with the idea of data gravity.
There is growing evidence that data-rich systems attract more data. This, in turn, attracts even more data-dependent applications, which then bring in yet more.
The idea of data gravity was first coined by IT researcher Dave McCrory in 2010. He argued that as organizations gather data in one place, it “builds mass”. That mass attracts services and applications, because the closer they are to the data, the better the latency and throughput.
As more data comes together, the process accelerates. Eventually, you arrive at a situation where it becomes difficult or impossible to move data and applications elsewhere to meet the business’s workflow needs.
As a result, costs rise, workflows become less effective, and firms can encounter compliance problems. McCrory, now at Digital Realty, publishes a data gravity index. He expects data gravity, measured in gigabytes per second, to grow by 139% between 2020 and 2024. This will put strain on IT infrastructure, he says.
At Forrester, researchers describe data gravity as a “chicken and egg” phenomenon. A recent report on data center trends sets out the problem.
“The concept states that as data grows at a specific location, it is inevitable that additional services and applications will be attracted to the data due to latency and throughput requirements,” it says. “This, in effect, grows the mass of data at the original location.”
Harder to scale
Examples of data gravity include applications and datasets moving to be closer to a central data store, which could be on-premise or co-located. This makes best use of existing bandwidth and reduces latency. But it also begins to limit flexibility, and can make it harder to scale to deal with new datasets or adopt new applications.
Data gravity occurs in the cloud, too. As cloud data stores increase in size, analytics and other applications move towards them. This takes advantage of the cloud’s ability to scale quickly, and minimizes performance problems.
But it perpetuates the data gravity issue. Cloud storage egress fees are often high and the more data an organization stores, the more expensive it is to move it, to the point where it can be uneconomical to move between platforms.
McCrory refers to this as “artificial” data gravity, caused by cloud services’ financial models, rather than by technology.
Forrester points out that new sources and applications, including machine learning/artificial intelligence (AI), edge devices or the internet of things (IoT), risk creating their own data gravity, especially if organizations fail to plan for data growth.
The growth of data at the enterprise edge poses a challenge when locating services and applications unless firms can filter out or analyze data in situ (or possibly in transit). Centralizing that data is likely to be expensive, and wasteful if much of it is not needed.
Impact on storage
The impact of data gravity on storage is essentially twofold – it drives up costs and makes management harder. Costs will increase with capacity requirements, but the increase for on-premise systems is unlikely to be linear.
In practice, firms will find they need to invest in new storage arrays as they reach capacity limits, potentially needing expensive capex spend. But there is a strong chance they will also have to invest in other areas to improve utilization and performance.
This might involve more solid-state storage, or tiering to move less-used data off the highest-performance systems and redundant systems to ensure availability, and storage management tools to control the whole process.
Some suppliers report that firms are turning to hyperconverged systems – which include storage, processing and networking in one box – to handle growing storage demands while balancing performance. By bringing processing and data closer together, hyperconverged systems deliver proximity and cut latency. But again, these systems are harder to scale smoothly.
In the cloud, capacity scales more smoothly, so CIOs should be able to match data storage more closely to data volumes.
However, not all businesses can put all their data into the cloud, and even those whose regulatory and customer requirements allow it will need to look at the cost and the time it takes to move data.
Proximity of data to processing is not guaranteed, so firms need cloud architects who can match compute and storage capacity, as well as ensure cloud storage works with their current analytics applications. They also need to be careful to avoid data egress costs, especially for data that moves frequently to business intelligence and other tools.
Cloud-native applications, such as Amazon QuickSight, are one option. Another is to use cloud gateways and cloud-native technologies, such as object storage, to optimize data between on-premise and cloud locations. For example, Forrester sees firms co-locating critical applications in data centers with direct access to cloud storage.
At the same time, CIOs need to be rigorous on cost management, and ensure that “credit-card cloud” purchases do not create data gravity hotspots of their own. Technologist Chris Swan has developed a cost model of data gravity, which can give quite a granular picture, for cloud storage.
Dealing with data gravity
CIOs, analysts and suppliers agree that data gravity cannot be eliminated, so it needs to be managed.
For enterprise CIOs and chief data officers, this means striking a balance between too much and too little data. They should challenge business on the data they collect, and the data they hold. Is all that data needed? Could some be analyzed closer to the edge?
Tackling data gravity also means having robust data management and data governance strategies. This should extend to deleting unneeded data, and applying effective tiering and archiving to cut costs.
Cloud will play its part, but costs need to be controlled. Firms are likely to use multiple clouds, and data gravity can cause costly data movements if application and storage architectures are not designed well. Analytics applications, in particular, can create silos. Firms need to look at the data sets they hold and ask which are prone to data gravity. These are the applications that need to be hosted where storage can be designed to scale.
Tools that can analyze data in situ and remove the need to move large volumes can reduce the impact of data gravity and also some of the cost disadvantages of the cloud. This comes into its own where organizations need to look at datasets across multiple cloud regions, software-as-a-service (SaaS) applications, or even cloud providers.
Organizations should also look at the network edge to see whether they can reduce volumes of data moving to the center and use real-time analytics on data flows instead.
With ever-growing demand for business data and analytics, CIOs and CDOs are unlikely to be able to eliminate data gravity. But with new and emerging data sources such as AI and IoT, they at least have the chance to design an architecture that can control it.