Data Lake
The following article exclusively pertains to a Graylog Enterprise feature or functionality. To learn more about obtaining an Enterprise license, please contact the Graylog Sales team.
A Data Lake is a repository for log data that allows you to store large amounts of data that are not immediately required for search and analysis in Graylog but that you still want to retain. Logs can be routed to a Data Lake as a part of the Data Routing function in Graylog. You can utilize either Amazon S3 or local network file storage as backend storage for your Data Lakes.
Routing logs to a Data Lake is enabled on an individual stream basis, and all logs that are filtered from that stream to the Data Lake are written to the Data Lake immediately after processing. Log data routed to a Data Lake can also be previewed and retrieved at a later date for search and analysis, event and alert monitoring, building dashboards and reports, and much more.
In this section of the documentation, we review how to set up and manage a Data Lake, how to preview logs within a Data Lake, and how to retrieve logs from your Data Lake.
Prerequisites
Before proceeding, ensure that the following prerequisites are met:
-
You must be a Graylog administrator to set up and manage a Data Lake.
Highlights
The following highlights provide a summary of the key takeaways from this article:
-
A Data Lake provides long-term storage for log data that can be previewed and retrieved when necessary and is generally a lower cost option than archives.
-
Data Lake preview provides a high-level view of stored data that you can use to help determine whether to start a data retrieval operation.
-
You can retrieve log data from the Data Lake when you need it for search, analysis, visualization, reporting, etc.
-
Data routed to the Data Lake and not your search backend does not count against your license usage until it is retrieved.
Data Lake vs. Archive
Currently, you can utilize both a Data Lake and archives to preserve your log data long term. Both features perform similar functions, but utilizing a Data Lake includes some benefits for less immediately valuable data. Retrieving logs from a Data Lake is generally a faster process because log retrieval is granular. Additionally, the data in a Data Lake is compressed, so it is often a lower cost option for data storage.
Route Your Logs to Your Data Lake
You use a Data Lake primarily for long-term storage of log data. First, you must configure your backend storage solution. You can also establish a data retention policy that determines how long your data is retained in the Data Lake.
To route log data to a configured backend storage solution, you need to enable Data Routing on the stream containing the data you plan to store and select Data Lake as one of your log destinations. Additionally, you can create filter rules for your selected stream that determine which logs are sent to the Data Lake and which logs should be sent to other destinations, such as an index set or an output.
Preview Data in Your Data Lake
The Data Lake Preview provides a high-level view of the data stored in your Data Lake. The Preview page lets you apply filters to help you target specific data, which you can then inspect before deciding whether to retrieve it into Graylog for search and analysis and other functions.
Retrieve Logs from Your Data Lake
When you need to search and analyze your logs from a Data Lake, you first must retrieve the data so that it can be written to your search backend. You retrieve log messages based on the streams used to route logs to your Data Lake, and logs are restored to the index set you specified upon initially creating the stream.
You can perform a selective retrieval by applying filters and by setting the time range from which to pull the data. To ensure you retrieve the data you need, you can use preview first to determine if matching logs appear in the Data Lake.
Manage Your Data Lake
After you set up a Data Lake, you can monitor and manage it from the Overview tab at Data Lake > Setup. The Data Lake Jobs section lists current and recent jobs running against the Data Lake with their status. Use this information to troubleshoot any issues that occur as well as to plan data retrieval operations.
This page also lists all the streams that are routing logs into the Data Lake with details about each. From the streams list, you can start a preview logs or retrieve logs action for a stream. Click Data Routing to review or update the data routing definition for the stream.
Further Reading
Explore the following additional resources and recommended readings to expand your knowledge on related topics: