Logging at scale using Graylog - Billion+ messages, 100K req/sec
What is Graylog?
- Open source log management that actually works.
- Search, analysis and alerting and a lot more.
Ola Infrastructure Overview
- Hundreds of micro-services
- > 100k messages per sec
- Billion+ logs per day
When did we start using it?
- ELK clusters are maintenance intensive
- Infra v2 revamp with centralized logging as a basic requirement
Great UI, Best for viewing Logs
Easy manageability of elasticsearch indexes
Realtime log analysis and alerts
Problems & learnings, the Ola story!
#1: Initial Pipeline
#1: Initial Pipeline
Huge message lag in Graylog UI
#1 Huge Lag for Application logs in Graylog UI
#2 Docker service crash due to Fluentd log driver
#3 Exceptions in Graylog server due to 3 MB log messages
#3 Truncate log messages before sending to kafka
#4 Inconsistent schema problem
#4 Convert everything to string at source
#5 Journal Utilisation too high, uncommitted messages deleted from journal
#5 Final setting - disable buffers and journal
#6 Missing logs due to slow fluentd kafka plugin
#6 Heka is superfast, 10x less CPU, 5x less memory
Who all loved it?
- Developers
- Devops
- Management