Create a Data Lake Backend on GCS

The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

Before you can start routing Graylog data to a Data Lake, you must first set up your backend storage. This article shows you how to set up a Data Lake storage backend in Graylog using Google Cloud Storage (GCS).

You can also use Amazon Web Services (AWS) or a local file system to set up your backend.

Prerequisites

Before proceeding, ensure that the following prerequisites are met:

Warning: We strongly recommend that you utilize either an Amazon S3 or GCS bucket as your method of backend storage for logs routed to a Data Lake. If you store logs on a local file store and reach your storage capacity, you need to switch your backend to gain more capacity. Changing your storage backend requires that you delete all of the data housed in your current backend storage solution!

  • You must be a Graylog administrator to set up and manage a Data Lake.

  • To use GCS, you must have an existing GCS bucket and appropriate credentials. GCS requires additional setup as described in the next section.

Google Cloud Prerequisite Setup

Before you can establish Google Cloud Storage (GCS) as a backend, you must complete setup on your Google Cloud account.

  1. Create a GCS bucket. Follow Google's documentation on buckets to complete this process. Note the following:

    • To create a bucket, you must have the Storage Admin IAM role assigned for the project.

    • The bucket name must be globally unique, and you cannot change this name after the bucket is created. Make sure to note your bucket name as you need to provide it in the backend setup process in Graylog.

    • The default Standard storage class is recommended. However, depending on your use case, you might determine a different class is a better fit. Make sure that you understand cost implications with Google for each choice.

    • When setting access control and data protection and retention, be sure to follow your company guidelines and security best practices. Also, be aware that your choices can have cost implications from Google.

  2. Create a Google Cloud service account. Follow Google's documentation on service accounts to complete this process. Set permissions for this account such that it can read, write, and delete from the bucket.

  3. In Google Cloud, set up Application Default Credentials (ADC). Follow Google's documentation on ADC to complete this process. Depending on your environment, the steps might be as follows:

    1. Download your service account key file from the Google Cloud console.

    2. On every Graylog node, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the key file with a command like the following:

      Copy
      export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/key.json

      Be sure to update the path in the above command to the location for your service account key file.

    3. To configure ADC with your Google account, run the following command in the Google Cloud CLI:

      Copy
      gcloud auth application-default login

    Hint: You must complete ADC setup on all Graylog nodes in your environment!
    Google provides instructions for setting up ADC on multiple environment types, including development, on-premises, cloud, and containerized. Use the instructions that match your Graylog deployment.

Create a GCS Storage Backend

To create a GCS storage backend for your Data Lake:

  1. Navigate to Data Lake > Setup. If you have existing backends, select the Backend tab. Any existing storage backends are displayed here.

  2. Select Create Data Lake Backend.

  3. Select GCS from the dropdown as the Backend Type.

  4. Enter configuration details for your GCS backend:

    Title

    Enter unique and descriptive name for the backend.

    Description

    Enter a description of the backend.

    GCS Bucket

    Enter the name of the GCS bucket in which logs will be stored.

    Project ID (optional)

    Enter your Google Cloud project’s unique identifier.

    The Google Cloud Project ID is a user-selected unique name that you can use to reference the Google project from Graylog. However, additional configuration is required, such as enabling the Google Cloud API, granting appropriate IAM roles, and additional authentication and authorization steps. Consult the Google Cloud documentation for complete information.

    Endpoint URI (optional)

    Enter a custom endpoint for accessing the storage. Leave this field blank to use the default value.

    GCS Output base path

    Enter the base path where the archives should be stored within the GCS bucket.

    You can use a single bucket for multiple purposes. For instance, you could use the same bucket for a Data Lake backend and a warm tier snapshot backend. However, if you do, it is important to use different sub folders for each specific use. The base path you set here determines the sub folder structure for this backend.

    Warning: This value can only be set on backend creation and cannot be changed at a later date!

  5. Click Create to complete configuration of the storage backend.

  6. Click Activate to make this the active storage backend. You must activate the backend before it can be used for storage. You can have multiple backends defined, but only one can be active. See the warning below about data loss if you are switching from an existing storage backend.

If you need to update settings for the Data Lake, such as changing access credentials, click Edit. You are presented with the same options as on initial creation. As noted, you cannot change the output base path after your initial save, but you can update the other settings.

Change Your Storage Backend

Warning: When you change your storage backend, you are required to delete all the data stored in your current backend. At this time, we recommend that you do NOT change your storage backend unless absolutely necessary because this data will be lost!

To change your active storage backend:

  1. Create a new storage backend or select one you have previously created.

  2. Click Activate.

    Graylog prompts you to confirm you want to change your storage backend. Graylog recommends you do not change your storage backend! All the data written to the previous storage backend must be deleted before you can switch.

    Warning: Deleting Data Lake data requires you to first stop routing data to the Data Lake. Note that if the affected streams are routing only to the Data Lake, you risk losing new data until you complete the process and start routing again with the new storage backend.

  3. Click Confirm to proceed.

The storage backend has now been switched. As new logs arrive, they are routed to the newly activated Data Lake storage backend.

Delete Backend Data

Before you can switch a storage backend, you must delete any data in the old storage backend. It is recommended that you delete this data with the following steps:

  1. Navigate to the Overview tab of Data Lake > Setup.

  2. Disable the Data Lake for each stream that is routing data to this backend. Click Data Routing, then toggle the Data Lake to Disabled.

  3. Delete the data from each stream.

    1. Select More > Delete.

    2. Select the Full Delete check box.

    3. Click Delete.

  4. Verify that the message count for all streams hits 0.

Further Reading

Explore the following additional resources and recommended readings to expand your knowledge on related topics: