Data Lake

The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.

A Data Lake is a repository for log data that allows you to store large amounts of data that are not immediately required for search and analysis in Graylog but that you still want to retain. Logs can be routed to a Data Lake as a part of the Data Routing function in Graylog. You can utilize Amazon S3, Google Cloud Storage (GCS), or local network file storage as backend storage for your Data Lakes.

Warning: Please note that non-Amazon S3-compatible storage is not supported!

Routing logs to a Data Lake is enabled on an individual stream basis, and all logs that are filtered from that stream to the Data Lake are written to the Data Lake immediately after processing. Log data routed to a Data Lake can also be previewed and retrieved at a later date for search and analysis, event and alert monitoring, building dashboards and reports, and much more.

In this section of the documentation, we review how to set up and manage a Data Lake, how to preview logs within a Data Lake, and how to retrieve logs from your Data Lake.

Hint: If your license expires, you can still write data to the Data Lake, but these logs cannot be previewed or retrieved until the license is renewed.

Prerequisites

Before proceeding, ensure that the following prerequisites are met:

You must be a Graylog administrator to set up and manage a Data Lake.

Highlights

The following highlights provide a summary of the key takeaways from this article:

A Data Lake provides long-term storage for log data that can be previewed and retrieved when necessary and is generally a lower cost option than archives.
Data Lake preview provides a high-level view of stored data that you can use to help determine whether to start a data retrieval operation.
You can retrieve log data from the Data Lake when you need it for search, analysis, visualization, reporting, etc.
Data routed to the Data Lake and not your search backend does not count against your license usage until it is retrieved.

Data Lake vs. Archive

Currently, you can utilize both a Data Lake and archives to preserve your log data long term. Both features perform similar functions, but utilizing a Data Lake includes some benefits for less immediately valuable data. Retrieving logs from a Data Lake is generally a faster process because log retrieval is granular. Additionally, the data in a Data Lake is compressed, so it is often a lower cost option for data storage.

Route Your Logs to Your Data Lake

You use a Data Lake primarily for long-term storage of log data. First, you must configure your backend storage solution. You can also establish a data retention policy that determines how long your data is retained in the Data Lake.

To route log data to a configured backend storage solution, you need to enable Data Routing on the stream containing the data you plan to store and select Data Lake as one of your log destinations. Additionally, you can create filter rules for your selected stream that determine which logs are sent to the Data Lake and which logs should be sent to other destinations, such as an index set or an output.

Preview Data in Your Data Lake

The Data Lake Preview provides a high-level view of the data stored in your Data Lake. The Preview page lets you apply filters to help you target specific data, which you can then inspect before deciding whether to retrieve it into Graylog for search and analysis and other functions.

Retrieve Logs from Your Data Lake

When you need to search and analyze your logs from a Data Lake, you first must retrieve the data so that it can be written to your search backend. You retrieve log messages based on the streams used to route logs to your Data Lake, and logs are restored to the index set you specified upon initially creating the stream.

You can perform a selective retrieval by applying filters and by setting the time range from which to pull the data. To ensure you retrieve the data you need, you can use preview first to determine if matching logs appear in the Data Lake.

Warning: Logs that are routed to a Data Lake and not sent to your search backend do not count against license usage until those logs are retrieved. Log data counts against license usage upon retrieval!

Create Data Lake Backend Storage

Before you can start routing data to a Data Lake, you must first set up your backend storage. You can utilize either Amazon S3, Google Cloud Storage (GCS), or local network file storage as backend storage for your Data Lake.

Prerequisites and setup steps are different for each type of backend. Check the appropriate topic for your choice of backend:

Manage Your Data Lake

After you set up a Data Lake, you can monitor and manage it from the Overview tab at Data Lake > Setup. The Data Lake Jobs section lists current and recent jobs running against the Data Lake with their status. Use this information to troubleshoot any issues that occur as well as to plan data retrieval operations.

This page also lists all the streams that are routing logs into the Data Lake with details about each. From the streams list, you can start a preview logs or retrieve logs action for a stream. Click Data Routing to review or update the data routing definition for the stream.