Archiving

The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

Storing extensive amounts of data in an OpenSearch cluster can be costly. Graylog allows you to store inactive data in the Graylog archive to help lower storage costs and maximize retention as needed. When archiving index sets you can set a retention period based on a variety of factors. Archived indices are then deleted after the retention period is complete.

The Graylog archive may be the best choice for data storage for a variety of reasons, including:

  • A more cost efficient alternative to storing messages in OpenSearch

  • To fulfill compliance regulations such as HIPAA, PCI, and others

In Graylog you may choose to archive index sets to compressed flat files on the local file system or as S3-compatible object storage. Note that archived index sets are stored before retention cleaning begins so that no data is lost.

Hint: Archived data is inactive and cannot be searched until restored. If necessary, archived indices can be re-imported through the user interface. Once it is restored, you can search through and analyze this data via the web interface. See Restore an Archive for more information.

Configure the Graylog Archive

You can configure the archive directly via the Graylog web interface:

  1. Navigate to Enterprise > Archives

  2. Select the Configuration tab.

The following configuration options are available in this menu:

Name

Description

Backend

Backend on the node where the archived files are stored.

Enable multithreading

Enables faster archiving by using multiple threads per search node.

Max segment size

Maximum size (in bytes) of archive segment files.

Compression type

Compression type used to compress the archives.

Checksum type

Algorithm used to calculate the checksum for archives.

Restore index batch size

OpenSearch batch size when restoring archive files.

Streams to archive

Streams included in the archive.

Choose a Backend

Archived indices are stored in a backend. You can choose either type based on your environment or preference:

Enable Multithread Archiving

Multithread archiving allows for the use of more than one Java thread per search node. It is a faster option for creating archives because it expands write capacity by running a thread for each shard.

Multithread archiving is enabled by default on new installations. To enable this feature on existing instances, complete the following steps:

  1. Navigate to Enterprise > Archives.

  2. Select the Configuration tab.

  3. Select the Enable multithreading check box.

WarningAlthough selecting multithreading may increase archiving speed, it may also increase the load on your search cluster, which may impact the performance of your OpenSearch cluster.

Select a Max Segment Size

When you archive an index, the archive job writes the data into segments. The Max Segment Size setting sets the size limit for each of these data segments to control the size of the segment files and process them with tools with a file-size limit.

Once the size limit is reached, a new segment file is started. For example:

Copy
/path/to/archive/
  graylog_201/
    archive-metadata.json
    archive-segment-0.gz
    archive-segment-1.gz
    archive-segment-2.gz

Select a Compression Type

Archives are compressed with gzip by default, but you can switch to a different compression type.

The selected compression type significantly impacts the time it takes to archive an index. For example, gzip is generally slower, but the compression rate is more efficient. Snappy and LZ4 are quicker, but the archives will be larger.

Here is a general comparison between the available compression algorithms with select test data. Results will vary based on your data!

Type

Index Size

Archive Size

Duration

gzip

1 GB

134 MB

15 minutes, 23 seconds

Zstandard

1 GB

225.2 MB

5 minutes, 55 seconds

Snappy

1 GB

291 MB

2 minutes, 31 seconds

LZ4

1 GB

266 MB

2 minutes, 25 seconds

WarningThe current implementation of LZ4 is not compatible with LZ4 CLI tools, so it is currently impossible to decompress LZ4 archives outside of Graylog.

Select a Checksum Type

When Graylog writes archives, it also computes a CRC32 checksum over the files. You can select a different option to use a different checksum algorithm if needed.

To find the most appropriate type of checksum, you might consider that CRC32 and MD5 are quick to compute and are a reasonable choice to detect damaged files, but neither are suitable for protection against malicious changes in the files. Graylog supports using SHA-1 or SHA-256 checksums, which can ensure the files were not modified as they are cryptographic hashes.

When selecting a checksum type, we recommend that you determine:

  • Whether the necessary system tools to compute them are installed (SHA-256 utility, for example).

  • The speed of checksum calculation for larger files.

  • Security considerations.

Set the Restore Index Batch Size

The Batch Size setting controls the batch size for re-indexing archive data into OpenSearch. When set to 1000, the restore job re-indexes the archived data in document batches of 1000.

Use this setting to control the speed of the restore process and the amount of load it generates on the OpenSearch cluster. The higher the batch size, the faster the restore progresses, and the more load is put on your OpenSearch cluster beyond the normal message processing.

Tune this setting carefully to avoid any negative impact on your message indexing throughput and search speed.

Configure Index Retention with Data Tiering

Hint: Our model for rotation and retention of indices is referred to as data tiering. Data tiering is offered as an option to store and manage data in tiers for specific purposes. See Data Tiering for more information.

You can determine how long you want to retain the data in a new or existing index set. To do so:

  1. Navigate to the Rotation and Retention section on the index set configuration page.

  2. Toggle to the Data Tiering option.

  3. Select the minimum and maximum amount of days you want to store your data.

Data is deleted once the maximum amount of days are reached. If you wish to keep this data, you may select Archive before deletion.

If you have an Enterprise license, you may also select Enable Warm Tier. You can then enter the minimum amount of days you want the data to stay in the hot tier before moving it to warm. You can also designate a warm storage repository to store the index set in. If none exist, you can create a new warm storage repository in this menu.

Configure Index Retention (Legacy)

WarningWe strongly recommend utilizing data tiering for the rotation and retention of index sets. The following legacy strategies will be deprecated in the near future.

Graylog uses configurable index retention strategies to delete old indices. By default, indices can be closed or deleted if they exceed the configured limit.

The Graylog archive offers a separate index retention strategy that you can configure to automatically archive an index before closing or deleting it. Select Archive index to enable this feature. (See Index Model for details on these strategies.)

As with the other index retention strategies, you can configure a maximum number of OpenSearch indices. When there are more indices than the configured limit, the oldest indices are archived in the backend and closed or deleted. You can also choose to do nothing after archiving an index by selecting NONE. In that case, no cleanup of old indices will happen, and you will be able to manage the archive yourself.

Select Streams To Archive

The Streams to Archive setting is included in the archive and allows you to archive only important data as determined by your streams, rather than everything that is brought into Graylog.

Hint: New streams are archived automatically. If you create a new stream and don’t want it to be archived, disable it in this configuration dialog.

Next Steps

Once you have completed setup for Graylog archiving, you can archive data for a specified index set!