Create and Manage Data Lake Storage

The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

Before you can start routing Graylog data to a Data Lake, you must first set up your backend storage. You can utilize either Amazon S3 or local network file storage as backend storage for your Data Lakes. You can also establish a data retention policy to you maintain legal or regulatory compliance.

Warning: Please note that non-Amazon S3-compatible storage is not supported!

This article shows you how to set up a Data Lake storage backend in Graylog, including how to establish data retention policies.

Prerequisites

Before proceeding, ensure that the following prerequisites are met:

Warning: We strongly recommend that you utilize an Amazon S3 bucket as your method of backend storage for logs routed to a Data Lake. If you store logs on a local file store and reach your storage capacity, changing your storage backend requires that you delete all of the data housed in your current backend storage solution!

  • You must be a Graylog administrator to set up and manage a Data Lake.

  • To use Amazon S3, you must have an existing AWS S3 bucket and appropriate access credentials.

Create a Data Lake Storage Backend

To create a storage backend for your Data Lake:

  1. Navigate to Data Lake > Setup. If you have existing backends, select the Backend tab. Any existing storage backends are displayed here.

  2. Select Create Data Lake Backend.

  3. Select either S3 (preferred) or File system.

  4. Enter configuration details for your selected backend type.

    For Amazon S3, the following configuration options are available:

    Title

    A unique and descriptive name for the backend.

    Description

    Description of the backend.

    S3 Endpoint URL

    The URL that provides the location of the S3 server.

    AWS Authentication Type

    You may choose between automatic or key and secret authentication. For more information, see the AWS credential configuration documentation.

    AWS Assume Role (ARN) (optional)

    The Amazon Resource Name (ARN) with required cross-account permission.

    S3 Bucket Name

    The name of the S3 bucket in which logs will be stored.

    AWS Region

    The physical location for your cluster data center.

    S3 Output Base Path

    The base path where the archives should be stored within the S3 bucket.

    Warning: This value can only be set on backend creation and cannot be changed at a later date!

    If you select a file system storage option, the following configuration options are available:

    Title

    A unique and descriptive name for the backend.

    Description

    Description of the backend.

    Disk Usage Threshold

    The percentage of disk utilization that should trigger a notification.

    Output Base Path

    The base path where the archives should be stored.

    Warning: This value can only be set on backend creation and cannot be changed at a later date!

  5. Click Create to complete configuration of the storage backend.

  6. Click Activate to make this the active storage backend. See the warning below about data loss if you are switching from an existing storage backend.

If you need to update settings for the Data Lake, such as changing access credentials, click Edit. You are presented with the same options as on initial creation. As noted, you cannot change the Output Base Path after your initial save, but you can update the other settings.

Data Retention Policy

You can set a data retention policy for your Data Lake so that data is held in storage for only the amount of time you determine is required. This ability can be a key part of compliance with legal and regulatory requirements. In fact, you can set a global retention policy for the Data Lake, but also include separate retention settings for individual streams, where required.

Set Global Retention Policy

To set a global data retention policy for the Data Lake:

  1. Navigate to the Configuration tab of the Data Lake > Setup page.

  2. Under Retention settings, set the value for Maximum number of days in the Data Lake.

    Hint: You set this value using the conventions for durations from the ISO 8601 standard.

  3. Click Update configuration.

Set Stream Override Retention Policy

Warning: If you set a retention policy on a stream, that policy overrides the global policy for that data only. The global policy applies to any stream that does not have an individual stream policy applied.

To set a data retention policy for an individual stream:

  1. Navigate to the Overview tab of the Data Lake > Setup page and locate the appropriate stream.

  2. Click Data Routing.

    Hint: You can also navigate to the Streams page, locate the stream, click Data Routing, then proceed to step 3, Destinations.

  3. In the Data Lake section, click Data Retention.

  4. In the dialog box, set the value for Maximum number of days in the Data Lake.

    Hint: You set this value using the conventions for durations from the ISO 8601 standard.

  5. Click Update.

Change Your Storage Backend

Warning: When you change your storage backend, you are required to delete all the data stored in your current backend. At this time, we recommend that you do NOT change your storage backend unless absolutely necessary because this data will be lost!

To change your active storage backend:

  1. Create a new storage backend or select one you have previously created.

  2. Click Activate.

    Graylog prompts you to confirm you want to change your storage backend. Graylog recommends you do not change your storage backend! All the data written to the previous storage backend must be deleted before you can switch.

    Warning: Deleting Data Lake data requires you to first stop routing data to the Data Lake. Note that if the affected streams are routing only to the Data Lake, you risk losing new data until you complete the process and start routing again with the new storage backend.

  3. Click Confirm to proceed.

The storage backend has now been switched. As new logs arrive, they are routed to the newly activated Data Lake storage backend.

Delete Backend Data

Before you can switch a storage backend, you must delete any data in the old storage backend. It is recommended that you delete this data with the following steps:

  1. Navigate to the Overview tab of Data Lake > Setup.

  2. Disable the Data Lake for each stream that is routing data to this backend. Click Data Routing, then toggle the Data Lake to Disabled.

  3. Delete the data from each stream.

    1. Select More > Delete.

    2. Select the Full Delete check box.

    3. Click Delete.

  4. Verify that the message count for all streams hits 0.

Further Reading

Explore the following additional resources and recommended readings to expand your knowledge on related topics: