日志系统从ELK到PLG

为什么要换掉ELK

话说使用ELK(包括Kafka和Filebeat)做日志统计分析已经有三年多了,用着还是不错的,功能很强大,当然学习门槛也挺高,当年搞了很久的grok,但更大的缺点是资源消耗有点大,所以在实际工作中,碰到资源紧张的环境就没法用了,有一点局限性。

最近听说Grafana搞了一个Loki,配合Promtail组成PLG日志系统还不错,于是打算试试。初步测试下来,PLG的资源占用的确少太多,主要还是因为ELK里ElasticSearch和Logstash都是JAVA写的,Logstash还用了JRuby,太吃内存:

ELK要能跑起来,基本要求:ElasticSearch-2G内存,Logstash-1G内存,Kibana-一两百M(node.js比JVM良心多了),Kafka是可选的不算,Filebeat-几十M(go更良心)

PLG则省心得多,三个加起来有个一两百M就能跑了……

PLG和ELK的对应关系基本上是这样:

  • Grafana是页面前端,对应Kibana
  • Loki是数据存储和查询引擎,对应ElasticSearch
  • Promtail是日志收集分析客户端,对应Logstash+Filebeat

Loki和Grafana的安装

当然是用docker最简单,参见docker-compose.yml文件:

version: '2'
services:
  loki:
    image: grafana/loki:master
    container_name: loki
    restart: always
    ports:
      - 127.0.0.1:3100:3100
    volumes:
      - /var/lib/loki:/loki
      - /etc/loki:/etc/loki
  grafana:
    image: grafana/grafana:master
    container_name: grafana
    restart: always
    depends_on:
      - loki
    ports:
      - 127.0.0.1:3000:3000
    volumes:
      - /var/lib/grafana:/var/lib/grafana

注意,需要创建以下几个文件夹:

/etc/loki: owner/group为root
/var/lib/loki: owner/group为10001-容器里loki用户的ID
/var/lib/grafana: owner/group为472-容器里grafana用户的ID

loki的配置文件/etc/loki/local-config.yaml:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  #  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
      chunks:
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h  # 7days

chunk_store_config:
  max_look_back_period: 2160h  # 90days

table_manager:
  retention_deletes_enabled: false
  retention_period: 2160h  # 90days

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-temp
  alertmanager_url: http://localhost:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true

基于默认的配置改了几个时间,具体参数定义详见官方文档。

现在可以启动了:

docker-compose up -d

启动完即可登录Grafana添加Loki数据源。默认的Grafana用户密码为:admin/admin。注意:Loki的地址为容器名:loki,不是localhost。

Promtail的安装和配置

直接在Loki的Release页面下载相应版本的Promtail,解压即可直接运行。

以读取Nginx日志为例配置promtail-config.yaml如下:

# Promtail Server Config
server:
  http_listen_port: 9080
  grpc_listen_port: 0

# Positions
positions:
  filename: /tmp/positions.yaml

# Loki服务器的地址
clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: 'nginx-access'
          app: 'nginx-access'
          host: 'your_hostname'
          __path__: /var/log/nginx/*.access.log
    pipeline_stages:
      - match:
          selector: '{job="nginx-access"}'
          stages:
            - regex:
                expression: '^(?P<client_ip>[\w\.]+) - (?P<auth>[^ ]*) \[(?P<timestamp>.*)\] "(?P<verb>[^ ]*) ?(?P<request>[^ ]*)? ?(?P<protocol>[^ ]*)?" (?P<status>[\d]+) (?P<response>[\d]+) "(?P<referer>[^"]*)" "(?P<agent>[^"]*)"'
            - labels:
                client_ip:
                auth:
                timestamp:
                verb:
                request:
                response:
                referer:
                agent:
                status:
            - timestamp:
                source: timestamp
                format: "02/Jan/2006:15:04:05 -0700"

个人觉得虽然Logstash用的grok功能强大,但是使用难度也大,还是Promtail这种正则表达式的方式简单得多,基本上也够用了。

用supervisor加持一下运行:

promtail -config.file=/path_to/promtail-config.yaml

推送到[go4pro.org]