Graylog Architecture

These days we deal with an abundance of data. This data comes from various sources like devices, applications, and operating systems. A centralized Log Management System (LMS) like Graylog provides a means to aggregate, organize, and make sense of all this data.

Log files are essentially text files. They contain an abundance of information--application name, IP address, time stamp, and source destination. All applications and even operating systems themselves create these logs containing massive amounts of data, which needs to be parsed if we want to make any sense of it.

An LMS must also be efficient in collecting and parsing petabytes of data. Once it has been parsed, log data can provide extremely useful information for forensic investigations, threat hunting, and business analytics in general. Whatever the use case, Graylog can help businesses look further into their data and save time and human resources.

Graylog Architecture

Graylog has been successful in providing an LMS because it was built for log management from the beginning. Software that stores and analyzes log data must have a specific architecture to do it efficiently. It is more than just a database or a full text search engine because it deals with both text data and metrics data on a time axis. Searches are always bound to a time frame (relative or absolute), and only going back into the past because future log data has not been written yet. A general purpose database or full text search engine that could also store and index the private messages of your online platform for search will never be able to effectively manage your log data. Adding a specialized frontend on top might make it look like it could do the job, but it is basically just putting lipstick on the wrong stack.

An LMS has to be constructed of several services that take care of processing, indexing, and data access. You need to scale parts of your LMS horizontally along with your changing use cases, and the different parts of the system typically have different hardware requirements. All services must be tightly integrated to allow efficient management and configuration of the system as a whole. A data ingestion or forwarder tool is tedious to manage if the configuration has to be stored on the client machines and is not possible via REST APIs, for example, controlled by a simple interface. A system administrator needs to be able to log into the web interface of a log management product and select log files of a remote host (that has a forwarder running) for ingestion into the tool.

You also want to be able to see the health and configuration of all forwarders, data processors, and indexers in a central place because the whole log management stack can easily involve thousands of machines if you include the log emitting clients into this calculation. You need to be able to see which clients are forwarding log data and which are not to make sure that you are not missing any important data.

Hint: Please note that Graylog cannot modify saved log data. It provides the core centralized log management functionality you need to aggregate, organize, and interpret data so that you can derive meaningful insights while maintaining data integrity.

Graylog Core Features

There are many features that enhance Graylog’s usefulness as a flexible tool.

Streams operate as a form of tagging for incoming messages. Streams route messages into categories in real time, and team rules instruct Graylog to route messages into the appropriate stream.
Streams are used to route data for storage into an index. They are also used to control access to data, and route messages for parsing, enrichment, and other modification. Streams then determine which messages to archive.
The Graylog Search page is the interface used to search logs directly. Graylog uses a simplified syntax, very similar to Lucene. Relative or absolute time ranges are configurable from drop down menus. Searches may be saved or visualized as dashboard widgets that may be added directly to dashboards from within the search screen.
Users may configure their own views and may choose to see either a summary or complete data from event messages.
Graylog Dashboards are visualizations or summaries of information contained in log events. Each dashboard is populated by one or more widgets. Widgets visualize or summarize event log data with data derived from field values such as counts, averages, or totals. Users can create indicators, charts, graphs, and maps to visualize the data.
Dashboard widgets and dashboard layouts are configurable. Graylog's role-based access controls dashboard access. Users can import and export dashboards via content packs.
Alerts are created using Event Definitions that consist of Conditions. When a given condition is met it will be stored as an Event and can be used to trigger a notification.
Content packs accelerate the set-up process for a specific data source. A content pack can include inputs/extractors, streams, dashboards, alerts, and pipeline processors.For example, users can create custom inputs, streams, dashboards, and alerts to support a security use case. Users can then export the content pack and import it on a newly installed Graylog instance to save configuration time and effort.
Users may download content packs which are created, shared and supported by other users via the Graylog Marketplace.
An Index is the basic unit of storage for data in OpenSearch and Elasticsearch. Index sets provide configuration for retention, sharding, and replication of the stored data.Values, like retention and rotation strategy, are set on a per-index basis, so different data may be subjected to different handling rules.
Graylog Sidecar is an agent to manage fleets of log shippers, like Beats or NXLog. These log shippers are used to collect OS logs from Linux and Windows servers. Log shippers read logs written locally to a flat file, and then send them to a centralized log management solution. Graylog supports management of any log shipper as a backend.
Graylog’s Processing Pipelines enable the user to run a rule, or a series of rules, against a specific type of event. Tied to streams, pipelines allow routing, denylisting, modification, and enrichment of messages as they flow through Graylog.

The Graylog Extended Log Format (GELF)

Graylog also introduces the Graylog Extended Log Format (GELF), which improves on the limitations of standard syslog, such as:

Limited to length of 1024 bytes. Inadequate space for payloads like backtraces.
No data types in structured syslog. Numbers and strings are indistinguishable.
The RFCs are strict enough, but there are too many syslog dialects to parse them all.
No compression.

Improvements on these issues make GELF a better choice for logging from within applications. The Graylog Marketplace offers libraries and appenders for easily implementing GELF in many programming languages and logging frameworks. Also, GELF can be sent via UDP so it cannot break your application from within your logging class.

For more information please refer to Parsing Log Files in Graylog.

The Basis: Graylog + Data Node + MongoDB

Log data is stored in a data node, either Elasticsearch or OpenSeach. Both are open source search engines. Your data node allows you to index and search through all the messages in your Graylog message database.

This combination grants users the ability to conduct an in depth analysis of petabytes of data. Because your data is accessible to your data node you are not dependent on a third party search tool. This also means that you have great flexibility in customizing your own search queries. You also have access to the many open-source projects and tools provided by Elasticsearch or OpenSearch.

The Graylog Server component sits in the middle and works around your data node which is a full text search engine and not a log management system. It also builds an abstraction layer on top of it to make data access as easy as possible without having to select indices and write tedious time range selection filters, etc. Just submit the search query and Graylog will take care of the rest for you.

Graylog uses MongoDB to store data. Only metadata such as user information and stream configurations are stored here, not log data. So MongoDB will not have a huge system impact. It runs alongside the Graylog server processes and takes up minimal space.

Community and Resources

The Graylog Community is 8,000 members strong. These Graylog users make valuable contributions to the product by asking questions, offering input for better performance, and providing support for fellow community members.

The Graylog Marketplace is where you can find plugins, extensions, content packs, a GELF library, and other solutions contributed by developers and community members.