Tito George is the co-founder of logiq.ai. He has over 14 years of experience in software systems development. Based in Bangalore, India, Tito enjoys developing distributed systems with a strong focus on DevOps and observability. He enjoys talking about cloud native and how businesses are adopting cloud native strategies to stay ahead of the competition. Follow him on Twitter @titogeo
Most log management solutions store log data in a database and allow searching by storing an index of the data. As the size of the database increases, so does the cost of maintaining the index. On a small scale, this is not a problem. But when faced with large-scale deployments, organizations end up using a lot of compute, storage, and human resources just to manage their indexes, in addition to the data itself. When businesses process terabytes of data every day, the database-based log management system becomes untenable.
Another common problem is that most logging solutions do not store a single set of data. Many DIY log management implementations use popular databases such as MongoDB, ElasticSearch, and Cassandra. Take Elasticsearch as an example. An Elasticsearch cluster runs multiple data replicas in the hot storage tier to ensure high availability. Even with data compression, the replication required to keep the data available dramatically increases the total amount of storage needed. The problem is magnified when you factor in the storage required for indexes.
Clustering also increases management complexity and requires users to understand how to handle node failures and data recovery. Even with replication, it is not possible to immediately launch a new instance when an instance crashes. In most cases, there are downtime when the log analysis system becomes unavailable. While this is happening, data continues to arrive as the logs are being generated in real time. Catching up requires an additional supply of resources. Since real-time data never stops, it can be difficult to catch up with the log analysis system. One-click elasticity is key to handling this at scale.
The challenges described above are classic examples of “Tax on storage operations” that any DIY solution has to pay for. The bigger the scale, the higher the tax! A business ingesting about a terabyte of data per day would need several terabytes of storage and a commensurate amount of RAM if it wanted to keep 30 days of searchable log data.
The way to solve this problem is to move away from databases and use a scalable API storage layer. An API storage layer like Amazon Web Services’ S3, traditionally used for cold storage, meets this requirement quite well. It offers high availability and durability, infinite scale, the lowest price per GB, and effectively reduces the tax on storage operations to zero. However, for this to work, you need to make sure that the applications don’t have the higher latency typical of cold storage.
Do you keep 30 days of data?
Companies think they keep 30 days of log data in their hot storage, but they don’t. Most queries come in the form of periodically run reports that are not interactive with a user sitting at the console. This is especially true on a large scale when it is not uncommon to ingest hundreds of megabytes or gigabytes of log data in a minute. Interactive workflows in such environments focus on identifying relevant events and data models which are then programmed into a machine and converted into real-time notifications to the administrator. This means that most of the data does not need to be hot stored at all, but can instead be processed online during ingestion or asynchronously at a later time.
There’s another good reason why businesses move data quickly to S3 or other compatible cold storage. Reducing the duration of data in a database separates data storage from compute and makes it easier for organizations to scale storage and recover failed clusters. It is significantly cheaper to store data in cold storage than in a database, and scaling cold storage is easier than scaling a database.
This approach, however, creates a new problem where we have to separate the data into multiple levels; hot and cold. Moving and managing data between the two levels requires expertise. Considerations of what to prioritize, how often to move data, and when to hydrate the hot level with data from the cold level are now becoming routine. The “tax on storage operations” has just increased.
What if I need long-term data retention?
In highly regulated environments, short-term retention is usually not an option, as companies have to store data, index it, and make the data searchable for several years. The same problems exist, but on an even larger scale. You have the choice of large amounts of expensive primary storage or a tiered storage architecture. With such requirements, it is not uncommon to have the implementation at multiple levels with most of the data being in the cold level, but with important data still in the hot level (for example, retention of 30 days). The “tax on storage operations” is not going anywhere, it is simply increasing.
Elimination of old storage architecture and data tiering
Businesses use a tiered approach to storage because they fear losing the ability to find data in cold storage. If research is required, a arduous application process makes accessing logs slow and difficult. Real-time searches cannot be performed on older data. For some types of applications, this is okay. Nonetheless, for critical path applications that generate revenue, having fast, real-time access to logs and the ability to extract information from them at any time is crucial. Having multiple levels of data, where there is a “hot” store and a “cold” store, creates management costs and overheads, especially for day 2 operations. Move everything to a warehouse hot would be extremely expensive. What if you could make cold storage your main warehouse?
Make S3 searchable or “Tax on zero storage operations”
What if we could make S3-compatible storage as searchable as a database? The reason businesses keep their log data in a database is to enable real-time research. Yet in practice, most organizations do not keep as much historical data in databases as their official data retention policies dictate. Suppose any S3-compatible store could be as searchable as a database. In this case, organizations can dramatically reduce the amount of data stored in databases and the associated computing resources required to manage that data. The most recent data – say, a minute of data – can be stored on disk, but after a minute everything goes to S3. It is no longer necessary to run multiple instances of a database for high availability, because if the cluster goes down, a new cluster can be launched and pointed to the same S3-compatible bucket.
Moving log data directly to cold storage while ensuring real-time search facilitates scalability, increases the availability of log data, and dramatically reduces costs, both on storage and compute resources. When log data is accessed directly in cold storage, users don’t have to worry about managing indexes between hot and cold storage tiers, rehydrating data, or creating complex policies. It also means that companies can follow the data retention plans they have to ensure that developers can access the logs and use them to debug critical applications.
Article by Michael Dziedzic on Unsplash.