Data Tiering

Graylog gives you the option to store and manage data in tiers. Each tier serves as a storage repository for data that needs to be handled in a similar way. Data tiers can be thought of as levels of data with the same storage and management specifications. This data is grouped based on how often it is used and the level of search performance required. Each tier has its own storage and accessibility criteria, meaning that you can move your data to a tier that better aligns with your needs.

Data tiering can help lower storage costs as it allows for practicality in data organization. Less frequently accessed data can be stored in a lower cost tier. For example, data that only needs to be retained for compliance checks can be stored in a tier that does not provide high performance storage and is therefore less expensive. You can reserve the costlier, high performance tier for more frequently searched data.

Data may be classified into tiers based on:

  • performance requirements

  • frequency of use

  • cost efficiency

We recommended data tiering as a cost-effective way of storing data for self-managed installations.

Hint: Data tiering is not available for Cloud customers since Graylog handles advanced index configuration and storage for Cloud customers as part of the managed service.

Explore the Tiers

In a data tiering model, there are essentially three tiers for data storage: the hot tier, the warm tier, and the archive.

The Hot Tier

New indices that are part of a data stream are automatically allocated to the hot tier, which in the case of tiering refers to the search backend cluster you use for storage and acts as the default for all incoming data. Data in the hot tier is easy to access and search, but operating costs are generally higher because of the resources that must be allocated to maintain it.

The Warm Tier

The following section exclusively pertains to a Graylog Enterprise feature. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

Data in the warm tier is searchable, but search performance is lower compared to the hot tier. Warm data is stored in searchable snapshots and not directly in index sets. Once a search is triggered in the warm tier, the system loads the data from this restored layer into the search backend cluster. The warm tier is suitable for storing data that does not require frequent access, such as logs from recent weeks.

Warning: The warm tier is searchable, but adding any warm indices to a search may slow down the search process. Please note that a lower search speed may also hinder the performance of any search-based features such as widgets and dashboards.

A searchable snapshot index reads from the repository and does not download all data to the cluster at restore time. This method makes the warm tier a cost-effective storage solution. Snapshots are stored in a warm storage repository, which may be either an AWS S3 bucket or a local file system. Searchable snapshots remain in the repository in snapshot format and are read-only.

Snapshots on OpenSearch

Warning: Utilizing an existing OpenSearch node in your cluster as a warm tier search node will impact overall cluster performance!

As warm tier data is stored in snapshots on OpenSearch, if you opt to utilize a warm tier, then backing up your data on OpenSearch will require careful consideration. If you are utilizing snapshots directly on OpenSearch as a backup strategy and you wish to use the warm tier option, then you need to consider using a dedicated search node for warm tier data. This may require you to rearchitect your OpenSearch cluster to create a dedicated search node.

Archiving

The following section exclusively pertains to a Graylog Enterprise feature. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

Graylog offers archiving for less critical data, making it a lower cost option for storing compliance and historical data.

The archive stores messages until you need to re-process them into Graylog for analysis. You can instruct Graylog to automatically archive log messages to compressed flat files on the local file system or to an S3-compatible object storage. Messages are stored before retention cleaning begins, and they are not deleted from search backend.

Hint: Currently, you can utilize both Data Warehouses and archives to preserve your log data long term as both features perform similar functions; however, there are some benefits to utilizing a Data Warehouse for less immediately valuable data. Retrieving logs from a Data Warehouse is a faster process as log retrieval is granular. Additionally, the data in a Data Warehouse is compressed, so it is generally a lower cost option for data storage.

Prepare Your Environment for a Warm Tier

Prerequisites

Warning: The following steps detail how to prepare your environment for data tiering when deploying Graylog with a self-managed OpenSearch cluster. If you instead opt to utilize Data Node, please proceed to the following sections on setting up a warm tier.

Confirm Compatibility

If your search backend cluster is not compatible with data tiering, Graylog displays a warning on the Indices and Index Sets page and the Index Set Overview page if:

  • your version of search backend is not compatible

  • your Security or Enterprise license has expired

Warning: If the warm tier is disabled, you still may be able to perform searches in the warm tier, but there is no rollover from the hot tier to the warm tier. This limitation may cause performance issues.

Install the S3 OpenSearch Plugin and Add Keys to Keystore

If you are using S3 as your data storage, follow OpenSearch guidance on installing the S3 plugin and adding your AWS access and secret keys to the OpenSearch keystore.

Create a Repository

You must create at least one storage repository to store snapshots. We recommend that you locate your warm data in an S3 bucket (please note applicable security settings); however, you may also choose to store this data in any supported file system repository according to your preference.

You may create multiple repositories or split your storage between S3 and the file system. Repositories can be created through Graylog:

  1. Navigate to System >Indices and locate the desired index set.

  2. Click the Edit button found on the right side of the screen.

  3. Toggle to select Data Tiering.

  4. Scroll to the Rotation and Retention section.

  5. Click Create new repository.

  6. Select either S3 or FS (your local file system) as a repository type.

  7. Give your repository a unique name.

  8. Choose a location from the drop-down menu. (The selections are locations that are detected in your OpenSearch configuration file).

  9. Click Create.

For the remainder of this article, we assume that you are utilizing an S3 bucket for data storage. Please follow the recommendations provided by your storage vendor if you opt to use another storage method. Note that for other file system repositories, you must at a minimum add a file system path to the OpenSearch configuration file using the path.repo property: 

Copy
path.repo: ["/mnt/snapshots"]

Configure the OpenSearch Configuration File

Assign the search Role

OpenSearch nodes used for the warm tier must have the search role. Assign this via OpenSearch 's configuration file:

Copy
node.roles: [search]

If you are not sure whether your configuration file includes this role, you can use the _cat/nodes API endpoint:

Copy
curl "http://127.0.0.1:9200/_cat/nodes?v&h=ip,name,node.role,node.roles"

So, for example:

Copy
ip            name node.role node.roles
192.168.0.153 glwn s         search

Modify the Node Cache Size for Searchable Snapshots

Verify that the node_search_cache_size parameter is included in the OpenSearch configuration file. If not, then it must be added.

Set the value to 10gb:

Copy
node.search.cache.size: 10gb

Monitoring the performance of your warm tier is critical to optimizing this value. See the section on monitoring your system performance for more detail.

Set Up a Warm Tier

Once your environment is ready for data tiering, you can enable the warm tier for both new and existing index sets.

Enable the Warm Tier for a New Index Set

  1. Navigate to System >Indices and click Create index set.

  2. Scroll down to the Rotation and Retention section and select Data Tiering.

  3. Enter the minimum and maximum amount of days you want your data to be stored. You may choose to save an index after the maximum time limit by checking the Archive before deletion box.

  4. Select the Enable warm tier check box and enter the minimum number of days to keep your data in the hot tier. The visual synchronously displays how long your data will be kept in each tier as you make your selections.

  5. Select the repository you want your data stored in from the Repository drop-down menu. The menu includes any repositories you created earlier.

  6. Click Create index set.

Enable the Warm Tier for an Existing Index Set

  1. Navigate to System > Indices and locate the desired index set.

  2. Click the Edit Index Set button.

  3. Scroll down to the Rotation and Retention section and select Data Tiering.

  4. Enter the minimum and maximum amount of days you want your data to be stored.

  5. Select the Enable warm tier check box and enter the minimum number of days to keep your data in the hot tier. The visual synchronously displays how long your data will be kept in each tier as you make your selections.

  6. Select the repository you want your data stored in from the Repository drop-down menu. The menu includes any repositories you created earlier.

  7. Click Update index set.

View Data Tiering Configuration

Once you have created or updated your index set:

  1. Navigate to System > Indices and Index Sets

  2. Click on the desired index set.

Here you can see warm displayed in the index title after the index prefix.

You may perform searches in the warm tier, and you can verify that your search results include the warm tier by checking the Stored in index section. You will see warm in the index set title.

Monitor System Resources of the Warm Tier

After your initial set up of the warm tier, we recommend that you monitor the system resource utilization of the Graylog warm tier nodes to determine the optimal amount of disk space for their file caches. In particular, the active vs. used percentage metrics (along with the total, active, used, and evicted bytes of the file cache) should be closely observed.

The percentage of used file cache should be less than the percentage of active file cache, whose values are ideally 70% or less. Importantly, the number of active bytes in the file cache should be less than the used bytes in the file cache, and the used bytes should be less than the total bytes of the file cache, so:

Active bytes < used bytes < total bytes

Use with Graylog Data Node

If you enable data tiering with the Graylog Data Node, then you must first issue a certificate authority for the third-party tool you use to query OpenSearch Node's API. Details on issuing a certificate authority may be found in the Data Node documentation.

If you are using self-managed OpenSearch, proceed to the following section.

Retrieve File Cache Metrics

The OpenSearch node stats API allows you to retrieve statistics about your cluster. Here is an example cURL command that can be used to retrieve the file cache metrics discussed in the previous section. (This may be useful in cases where OpenSearch metrics are not being captured and stored within a time-series datastore such as InfluxDB or Prometheus.)

Copy
$ curl -s -XGET "http://admin:password@10.0.1.229:9200/_nodes/stats/file_cache?pretty"

The following is an example snippet of output from the response of the above command.

Copy
"jW4Q6SuXQASt8lM796CBHg" : {
      "timestamp" : 1712857680055,
      "name" : "10.0.1.229",
      "transport_address" : "10.0.1.229:9300",
      "host" : "10.0.1.229",
      "ip" : "10.0.1.229:9300",
      "roles" : [
        "search"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "file_cache" : {
        "timestamp" : 1712857680055,
        "active_in_bytes" : 50066073022,
        "total_in_bytes" : 128849018880,
        "used_in_bytes" : 71497646618,
        "evictions_in_bytes" : 0,
        "active_percent" : 70,
        "used_percent" : 55,
        "hit_count" : 83077,
        "miss_count" : 877
      }
    }

For your reference, the following are the available OpenSearch metrics that may be retrieved via the node stats API:

  • nodes.stats.file_cache.active_in_bytes

  • nodes.stats.file_cache.total_in_bytes

  • nodes.stats.file_cache.used_in_bytes

  • nodes.stats.file_cache.evictions_in_bytes

  • nodes.stats.file_cache.active_percent

  • nodes.stats.file_cache.used_percent

  • nodes.stats.file_cache.hit_count

  • nodes.stats.file_cache.miss_count