Data Lake
                                                
                                                Graylog's Data Lake is a repository for storing large volumes of log data that is not immediately needed for search and analysis but is still important to retain. Graylog supports two types of data lake:
- 
                                                            
Internal data lake: For this type, you create backend storage through Graylog, then you use data routing on ingested logs to store the logs in the data lake.
 - 
                                                            
External data lake: For this type, you connect Graylog to an existing third-party data lake.
 
For both types, you can preview log data in the data lake through Graylog. You can also retrieve targeted data for search and analysis, event and alert monitoring, building dashboards and reports, and more.
In this article, we review how to set up and manage internal and external data lakes, how to preview logs within a data lake, and how to retrieve logs from your data lake.
                                                        
Prerequisites
Before proceeding, ensure that the following prerequisites are met:
- 
                                                            
You must be a Graylog administrator to set up and manage a Data Lake.
 - 
                                                            
For an external data lake, you must have an existing third-party data lake to connect to.
Hint: Currently, only Amazon Security Lake is supported for external data lakes. 
Highlights
The following highlights provide a summary of the key takeaways from this article:
- 
                                                            
A data lake provides long-term storage for log data that can be previewed and retrieved when necessary and is generally a lower cost option than archives.
 - 
                                                            
Data Lake Preview provides a high-level view of stored data that you can use to help determine whether to start a data retrieval operation.
 - 
                                                            
You can retrieve log data from a data lake when you need it for search, analysis, visualization, reporting, etc.
 - 
                                                            
Data routed to an internal data lake and not your search backend does not count against your license usage until it is retrieved.
 
Data Lake vs. Archive
Currently, you can utilize both Data Lake and archives to preserve your log data long term. Both features perform similar functions, but an internal data lake includes some benefits for less immediately valuable data. Retrieving logs from a data lake is generally a faster process because log retrieval is selective. Additionally, the data in a data lake is compressed, so it is often a lower cost option for data storage.
Route Logs to an Internal Data Lake
To use an internal data lake in Graylog, you must first configure your backend storage solution. You can set up Amazon S3, Google Cloud Storage (GCS), or local network file storage as a backend storage option. You can also establish a data retention policy that determines how long your data is retained in the data lake.
                                                        
To route log data to a configured backend storage solution, you must enable Data Routing on each individual stream containing data you plan to store and select Data Lake as one of your log destinations. All logs that are routed from those streams are written to the data lake immediately after processing.
Log data routed to a data lake can be previewed and retrieved at a later date for search and analysis, event and alert monitoring, building dashboards and reports, and more. Additionally, you can create filter rules for your selected streams that determine which logs are sent to the data lake and which logs should be sent to other destinations, such as an index set or an output.
Prerequisites and setup steps are different for each type of storage backend. Check the appropriate topic for your choice of backend:
                                                        
Connect to an External Data Lake
The external data lake feature allows you to connect to an existing third-party data lake through Graylog. This connection provides the same preview and retrieval functions you get with an internal data lake.
                                                        
To use an external data lake, you must create a connection on the Data Lake > External Lake Connectors page, which requires connection details and authentication to the third-party source.
Each connector you create must be associated with a system-managed stream to which data retrieved from this connector is stored. You cannot add stream rules, pipeline rules, routing destinations, or filter rules to a stream associated with a connector.
After you create a connector, you can preview log data or retrieve data similarly to how you use an internal data lake.
                                                        
For complete information, see Create an External Data Lake Connector.
Preview Data in Your Data Lake
The Data Lake Preview provides a high-level view of the data stored in your data lake. You can use this feature with both internal and external data lakes.
The Preview page lets you apply filters to help you target specific data, which you can then inspect before deciding whether to retrieve it into Graylog for search and analysis and other functions.
Retrieve Logs from Your Data Lake
When you need to search and analyze your logs from a data lake, you must retrieve the data so that it can be written to your search backend. For an internal data lake, you retrieve log messages based on the streams used to route logs to the data lake, and logs are restored to the index set you specified upon initially creating the stream. For external data lakes, you retrieve log messages based on the tables defined by the third-party solution's storage schema, and logs are restored to their associated system-manged stream.
You can perform a selective retrieval by applying filters and by setting the time range from which to pull the data. To ensure you retrieve the data you need, you can use preview first to determine if matching logs appear in the data lake.
                                                        
Manage Your Data Lake
After you set up data lakes, you can monitor and manage them from the Data Lake section in Graylog.
For internal data lakes, navigate to Data Lake > Internal Lake Setup. The Data Lake Jobs section lists current and recent jobs running against the Data Lake with their status. Use this information to troubleshoot any issues that occur as well as to plan data retrieval operations.
This page also lists all the streams that are routing logs into the internal Data Lake with details about each. From the streams list, you can start a preview logs or retrieve logs action for a stream. Click Data Routing to review or update the data routing definition for the stream.
For external data lakes, navigate to Data Lake > External Lake Connectors. This pages lists any external data lake connectors you have defined. From the connectors list, you can start a preview logs or retrieve logs action for the connector. You can navigate to More > Create Input to create and launch an input tied to a connector, which can continuously ingest appropriate logs from the source.
Navigate to Data Lake > Retrievals for a complete list of data retrievals that have been performed. This list includes retrievals from both internal and external data lakes. Select Show messages to view the data for a specific retrieval operation.
Further Reading
Explore the following additional resources and recommended readings to expand your knowledge on related topics:
