Archiving
The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.
Storing extensive amounts of data in an OpenSearch cluster can be costly. Graylog allows you to store inactive data in the Graylog archive to help lower storage costs and maximize retention as needed. When archiving index sets you can set a retention period based on a variety of factors. Archived indices are then deleted after the retention period is complete.
The Graylog archive may be the best choice for data storage for a variety of reasons, including:
-
A more cost efficient alternative to storing messages in OpenSearch
-
To fulfill compliance regulations such as HIPAA, PCI, and others
In Graylog you may choose to archive index sets to compressed flat files on the local file system or as S3-compatible object storage. Note that archived index sets are stored before retention cleaning begins so that no data is lost.
Configure the Graylog Archive
You can configure the archive directly via the Graylog web interface:
-
Navigate to Enterprise > Archives
-
Select the Configuration tab.
The following configuration options are available in this menu:
Name |
Description |
---|---|
Backend on the node where the archived files are stored. |
|
Enables faster archiving by using multiple threads per search node. |
|
Maximum size (in bytes) of archive segment files. |
|
Compression type used to compress the archives. |
|
Algorithm used to calculate the checksum for archives. |
|
OpenSearch batch size when restoring archive files. |
|
Streams included in the archive. |
Choose a Backend
Archived indices are stored in a backend. You can choose either type based on your environment or preference:
File system is the default backend. Initial server start up triggers the creation of a backend, but you may adjust the backend location in your Graylog configuration .yml file.
You can use a backend to store archived data. Graylog only supports a single-file system backend type.
The archived indices are stored in the Output base path directory. This directory needs to exist and be writable for the Graylog server process to store the files.
The archiving process runs on the leader node, so only the leader node needs access to the Output base path directory. We recommend housing the output base path directory on a separate disk or partition to avoid any negative impacts on message processing when the archive completely fills a disk.
You may edit archive backend configuration options via Enterprise > Archives under Manage Backends.
File System Backend Configuration Options
The following is a list of configurable properties for file system backends:
Name |
Description |
---|---|
Title |
A unique title to identify the backend. |
Description |
A description for the backend. |
Output base path |
Directory path where the archive files are stored. Use a simple directory path string or a template string as the output base path to build dynamic paths. |
Template String for Storage
You can also use a template string to store the archived data in a directory tree, which is based on the archival date. For example:
# Template
/data/graylog-archive/${year}/${month}/${day}
# Result
/data/graylog-archive/2017/04/01/graylog_0
The following variables may be adjusted in the template string as necessary:
Name |
Description |
---|---|
|
Archival date year (e.g. |
|
Archival date month (e.g. |
|
Archival date day (e.g. |
|
Archival date hour (e.g. |
|
Archival date minute (e.g. |
|
Archival date second (e.g. |
|
Name of the archived index (e.g. |
The S3 archiving backend is built to work with AWS and can be used to upload archives to an AWS S3 object storage service. AWS S3 is also compatible with other object storage implementations, such as MinIO, CEPH, and Digital Ocean Spaces.
To configure an S3 backend:
- Navigate to the Archives page and select the Manage Backends tab.
- Click Create Backend.
- Select S3 from the Backend Type drop-down.
- Complete the fields that best suit your choice of archive as indicated by the following chart.
- Finally, activate the backend as described in this section.
Configuration Options for an S3 Backend
Name |
Description |
---|---|
Title |
A unique title to identify the backend. |
Description |
Description of the backend. |
S3 Endpoint URL |
Only configure this if not using AWS. |
Choose access type from the drop-down menu. |
|
An optional input for alternate authentication mechanisms. |
|
Bucket Name |
The unique name of the S3 bucket. |
Directory where archiving data is stored before it is uploaded. |
|
Choose Automatic or configure the appropriate option. |
|
Archives are stored under this path. |
Select AWS Authentication Type
Graylog provides two options for granting access. You can:
- Utilize the Automatic authentication mechanism by providing AWS credentials through your file system or process environment.
- Enter credentials manually.
Assign AWS Assume Role (ARN)
ARN is typically used for allowing cross-account access to a bucket. See ARN for further details.
Adjust Spool Directory
The archiving process needs the spool directory to store some temporary data before it can be uploaded to S3.
Ensure that the directory is writable and has sufficient space for 10 times the Max Segment Size. You can make adjustments on the Configuration page, as mentioned previously.
Select AWS Region
If you are not using AWS, you do not need to configure the AWS Region.
If you are opting to use AWS, then select the AWS region where your archiving bucket resides. If you select nothing, Graylog uses the region from your file system or process environment, if available.
Configure the S3 Output Base Path
S3 Output Base Path is a prefix to the file name that works similarly to a directory. Configure this to help organize your data. For example, you can add the month, year, or day along with the index name to create subfolders based on these variables.
You can use the following variables to construct a dynamic value for each archive and give it structure:
Variable |
Description |
---|---|
|
Name of the index that gets archived. |
|
Archival date year. |
|
Archival date month. |
|
Archival date day. |
|
Archival date hour. |
|
Archival date minute. |
|
Archival date second. |
Apply AWS Security Permissions
When writing AWS security policies, make them as restrictive as possible. It is best practice to enable specific actions needed by the application rather than allowing all actions.
These permissions are required for Graylog to successfully make use of the S3 bucket:
Permission | Description |
---|---|
CreateBucket | Creates an S3 bucket. |
HeadBucket | Determines if an action is useful and if you have permission to access it. |
PutObject | Adds an object to a bucket. |
CreateMultipartUpload | Initiates a multipart upload and returns an upload ID. |
CompleteMultipartUpload | Completes a multipart upload by assembling previously uploaded parts. |
UploadPart | Uploads a part in a multipart upload. |
AbortMultipartUpload | Aborts a multipart upload. |
GetObject | Retrieves objects from Amazon S3. |
HeadObject | Retrieves metadata from an object without returning the object itself. |
ListObjects | Returns some or all (up to 1,000) of the objects in a bucket with each request. |
DeleteObjects | Enables you to delete multiple objects from a bucket using a single HTTP request. |
Activate the Backend
When you configure your bucket and select Save, you are routed back to the Edit archive backend configuration page. To activate a backend:
- Click the Configuration tab located in the top right corner.
- Select the backend you want to activate from the Backend drop-down menu.
- You can choose to change configurations or use the defaults provided at this time.
- Finally, click the green Update configuration button at the bottom of the screen.
This action navigates you back to the Archives page.
Enable Multithread Archiving
Multithread archiving allows for the use of more than one Java thread per search node. It is a faster option for creating archives because it expands write capacity by running a thread for each shard.
Multithread archiving is enabled by default on new installations. To enable this feature on existing instances, complete the following steps:
-
Navigate to Enterprise > Archives.
-
Select the Configuration tab.
-
Select the Enable multithreading check box.
Select a Max Segment Size
When you archive an index, the archive job writes the data into segments. The Max Segment Size setting sets the size limit for each of these data segments to control the size of the segment files and process them with tools with a file-size limit.
Once the size limit is reached, a new segment file is started. For example:
/path/to/archive/
graylog_201/
archive-metadata.json
archive-segment-0.gz
archive-segment-1.gz
archive-segment-2.gz
Select a Compression Type
Archives are compressed with gzip by default, but you can switch to a different compression type.
The selected compression type significantly impacts the time it takes to archive an index. For example, gzip is generally slower, but the compression rate is more efficient. Snappy and LZ4 are quicker, but the archives will be larger.
Here is a general comparison between the available compression algorithms with select test data. Results will vary based on your data!
Type |
Index Size |
Archive Size |
Duration |
---|---|---|---|
gzip |
1 GB |
134 MB |
15 minutes, 23 seconds |
Zstandard |
1 GB |
225.2 MB |
5 minutes, 55 seconds |
Snappy |
1 GB |
291 MB |
2 minutes, 31 seconds |
LZ4 |
1 GB |
266 MB |
2 minutes, 25 seconds |
Select a Checksum Type
When Graylog writes archives, it also computes a CRC32 checksum over the files. You can select a different option to use a different checksum algorithm if needed.
To find the most appropriate type of checksum, you might consider that CRC32 and MD5 are quick to compute and are a reasonable choice to detect damaged files, but neither are suitable for protection against malicious changes in the files. Graylog supports using SHA-1 or SHA-256 checksums, which can ensure the files were not modified as they are cryptographic hashes.
When selecting a checksum type, we recommend that you determine:
-
Whether the necessary system tools to compute them are installed (SHA-256 utility, for example).
-
The speed of checksum calculation for larger files.
-
Security considerations.
Set the Restore Index Batch Size
The Batch Size setting controls the batch size for re-indexing archive data into OpenSearch. When set to
, the restore job re-indexes the archived data in document batches of 1000.1000
Use this setting to control the speed of the restore process and the amount of load it generates on the OpenSearch cluster. The higher the batch size, the faster the restore progresses, and the more load is put on your OpenSearch cluster beyond the normal message processing.
Tune this setting carefully to avoid any negative impact on your message indexing throughput and search speed.
Configure Index Retention with Data Tiering
You can determine how long you want to retain the data in a new or existing index set. To do so:
-
Navigate to the Rotation and Retention section on the index set configuration page.
-
Toggle to the Data Tiering option.
-
Select the minimum and maximum amount of days you want to store your data.
Data is deleted once the maximum amount of days are reached. If you wish to keep this data, you may select Archive before deletion.
If you have an Enterprise license, you may also select Enable Warm Tier. You can then enter the minimum amount of days you want the data to stay in the hot tier before moving it to warm. You can also designate a warm storage repository to store the index set in. If none exist, you can create a new warm storage repository in this menu.
Configure Index Retention (Legacy)
Graylog uses configurable index retention strategies to delete old indices. By default, indices can be closed or deleted if they exceed the configured limit.
The Graylog archive offers a separate index retention strategy that you can configure to automatically archive an index before closing or deleting it. Select Archive index to enable this feature. (See Index Model for details on these strategies.)
As with the other index retention strategies, you can configure a maximum number of OpenSearch indices. When there are more indices than the configured limit, the oldest indices are archived in the backend and closed or deleted. You can also choose to do nothing after archiving an index by selecting NONE. In that case, no cleanup of old indices will happen, and you will be able to manage the archive yourself.
Select Streams To Archive
The Streams to Archive setting is included in the archive and allows you to archive only important data as determined by your streams, rather than everything that is brought into Graylog.
Next Steps
Once you have completed setup for Graylog archiving, you can archive data for a specified index set!