oncall平台部署及使用

Grafana OnCall 是 Grafana Labs 推出的一款开源事件响应与排班调度工具，可以帮助团队管理和跟踪故障处理情况，提高 SRE 团队的工作效率，更快地解决事件。可以自动路由警报到指定的值班团队和 ChatOps 频道，根据预定义的升级策略、时间表和通知偏好进行处理。

Oncall平台一般都要钱购买的，开源的不多，Granfana-Oncall是开源的，但是网上的资料比较少，官网写的也不是很详细，自己摸索了好几天，遇到不少坑。

部署

通过docker-compose部署

docker-compose.yaml:

x-environment: &oncall-environment
  DATABASE_TYPE: sqlite3
  BROKER_TYPE: redis
  BASE_URL: $DOMAIN
  SECRET_KEY: $SECRET_KEY
  FEATURE_PROMETHEUS_EXPORTER_ENABLED: ${FEATURE_PROMETHEUS_EXPORTER_ENABLED:-false}
  PROMETHEUS_EXPORTER_SECRET: ${PROMETHEUS_EXPORTER_SECRET:-}
  REDIS_URI: redis://redis:6379/0
  DJANGO_SETTINGS_MODULE: settings.hobby
  CELERY_WORKER_QUEUE: "default,critical,long,slack,telegram,webhook,retry,celery,grafana"
  CELERY_WORKER_CONCURRENCY: "1"
  CELERY_WORKER_MAX_TASKS_PER_CHILD: "100"
  CELERY_WORKER_SHUTDOWN_INTERVAL: "65m"
  CELERY_WORKER_BEAT_ENABLED: "True"
  GRAFANA_API_URL: http://grafana:3000
  TZ: Asia/Shanghai

services:
  engine:
    image: docker-0.unsee.tech/grafana/oncall
    restart: always
    ports:
      - "8080:8080"
    command: sh -c "uwsgi --ini uwsgi.ini"
    environment: *oncall-environment
    volumes:
      - oncall_data:/var/lib/oncall
    depends_on:
      oncall_db_migration:
        condition: service_completed_successfully
      redis:
        condition: service_healthy

  celery:
    image: docker-0.unsee.tech/grafana/oncall
    restart: always
    command: sh -c "./celery_with_exporter.sh"
    environment: *oncall-environment
    volumes:
      - oncall_data:/var/lib/oncall
    depends_on:
      oncall_db_migration:
        condition: service_completed_successfully
      redis:
        condition: service_healthy

  oncall_db_migration:
    image: docker-0.unsee.tech/grafana/oncall
    command: python manage.py migrate --noinput
    environment: *oncall-environment
    volumes:
      - oncall_data:/var/lib/oncall
    depends_on:
      redis:
        condition: service_healthy

  redis:
    image: docker-0.unsee.tech/redis:5.0
    restart: always
    expose:
      - 6379
    volumes:
      - redis_data:/data
    deploy:
      resources:
        limits:
          memory: 500m
          cpus: "0.5"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      timeout: 5s
      interval: 5s
      retries: 10

#  prometheus:
#    image: prom/prometheus
#    hostname: prometheus
#    restart: always
#    ports:
#      - "9090:9090"
#    volumes:
#      - ./prometheus.yml:/etc/prometheus/prometheus.yml
#      - prometheus_data:/prometheus
#    profiles:
#      - with_prometheus

  grafana:
    image: "docker-0.unsee.tech/grafana/${GRAFANA_IMAGE:-grafana:latest}"
    restart: always
    ports:
      - "3000:3000"
    environment:
      GF_FEATURE_TOGGLES_ENABLE: externalServiceAccounts
      GF_SECURITY_ADMIN_USER: ${GRAFANA_USER:-admin}
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS: grafana-oncall-app
      GF_INSTALL_PLUGINS: grafana-oncall-app
      GF_AUTH_MANAGED_SERVICE_ACCOUNTS_ENABLED: true
      TZ: Asia/Shanghai
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana.ini:/etc/grafana/grafana.ini
    deploy:
      resources:
        limits:
          memory: 500m
          cpus: "0.5"
    profiles:
      - with_grafana
#    configs:
#      - source: grafana.ini
#        target: /etc/grafana/grafana.ini

volumes:
  grafana_data:
  prometheus_data:
  oncall_data:
  redis_data:

#configs:
#  grafana.ini:
#    content: |
#      [feature_toggles]
#      accessControlOnCall = false

配置环境变量：.env文件：

DOMAIN=http://10.168.2.236:8080
# 如果您想使用现有的 grafana，请删除下面的“with_grafana”
# 添加下面的“with_prometheus”以选择性地为 oncall 指标启用本地 prometheus
# 例如 COMPOSE_PROFILES=with_grafana,with_prometheus
COMPOSE_PROFILES=with_grafana
# 为 prometheus 导出器指标设置身份验证令牌：
PROMETHEUS_EXPORTER_SECRET=my_random_prometheus_secret
# 确保启用 /metrics 端点：
FEATURE_PROMETHEUS_EXPORTER_ENABLED=True
SECRET_KEY=my_random_secret_must_be_more_than_32_characters_long

Grafana配置文件：grafana.ini文件：

[feature_toggles]
accessControlOnCall = false
 
[smtp]
enabled = true
host = smtp.exmail.qq.com:465
user = chenmingchang@keyfil.com
password = *************
from_address = chenmingchang@keyfil.com
from_name = chenmingchang

启动：

1	docker-compose pull && docker-compose up -d

启动完成后安装Grafana-oncall插件：

1
2

curl -X POST 'http://admin:admin@localhost:3000/api/plugins/grafana-oncall-app/settings' -H "Content-Type: application/json" -d '{"enabled":true, "jsonData":{"stackId":5, "orgId":100, "onCallApiUrl":"http://engine:8080", "grafanaUrl":"http://grafana:3000"}}'
curl -X POST 'http://admin:admin@localhost:3000/api/plugins/grafana-oncall-app/resources/plugin/install'

配置集成：

skywalking的告警通过webhook的类型，Altermanager直接用Altermanager的类型就行：

点击集成可以查看对应的Endpoint，配置到对应的skywalking和Altermanager配置文件里：

skywalking：

在alarm-settings.yml文件最后添加

1 2	webhooks: - http://10.168.2.236:8080/integrations/v1/webhook/0DIaZCp09dNem7g4P30PSR4KO/

Altermanager：

alertmanager.yaml文件

global:
  resolve_timeout: "5m"
templates:
   - '/etc/alertmanager/config/*.tmpl'
inhibit_rules:
- equal:
  - "namespace"
  - "alertname"
  source_match:
    severity: "critical"
  target_match_re:
    severity: "warning|info"
- equal:
  - "namespace"
  - "alertname"
  source_match:
    severity: "warning"
  target_match_re:
    severity: "info"
receivers:
- name: "default"
  webhook_configs:
  - send_resolved: true
    url: "http://10.168.2.236:3000"
- name: "oncall"
  webhook_configs:
  - send_resolved: true
    url: "http://10.168.2.236:8080/integrations/v1/alertmanager/tDZlvyPGfD2Uk2HvtlX0mS8XR/"
route:
  group_by:
  - "namespace"
  - "alertname"
  - "env"
  - "instance"
  - "type"
  - "group"
  - "job"
  - "cluster"
  - "app"
  group_interval: "10m"
  group_wait: "30s"
  receiver: "default"
  repeat_interval: "10m"
  routes:
  - match_re:
      severity: "warning|critical"
    receiver: "oncall"

集成配置完之后配置升级链：

触发告警之后执行的步骤：

可以在集成详情里引用不同的升级链：

配置Outgoing webhooks：

发送到prometheusalert平台对应的模板，再发送到企微：

配置模板内容：

Skywalking到grafana-oncall再到企微：

配置排班表：

配置发送邮箱：

触发告警时会按照升级链的配置执行相应的步骤：

Prometheus配置文件添加指标导出：

- job_name: prometheus
  metrics_path: /metrics/
  authorization:
    credentials: my_random_prometheus_secret
  static_configs:
    - targets: ["10.168.2.236:8080"]

通过k8s部署

需要先安装好helm

添加仓库：

1	helm repo add grafana https://grafana.github.io/helm-charts

安装：

helm install \
    --wait \
    --set base_url=grafana-oncall.keyfil.com \
    --set grafana."grafana\.ini".server.domain=grafana-oncall.keyfil.com \
    release-oncall \
    grafana/oncall -n oncall -f values.yaml

自定义values.yaml文件，可按需修改：

# Values for configuring the deployment of Grafana OnCall
 
# Set the domain name Grafana OnCall will be installed on.
# If you want to install grafana as a part of this release make sure to configure grafana.grafana.ini.server.domain too
base_url: grafana-oncall.qifu.com
base_url_protocol: http
 
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
## e.g:
## imagePullSecrets:
##   - name: myRegistryKeySecretName
imagePullSecrets: []
 
image:
  # Grafana OnCall docker image repository
  repository: docker-0.unsee.tech/grafana/oncall
  tag:
  pullPolicy: IfNotPresent
 
# Whether to create additional service for external connections
# ClusterIP service is always created
service:
  enabled: false
  type: LoadBalancer
  port: 8080
  annotations: {}
 
# Engine pods configuration
engine:
  replicaCount: 1
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
  # Labels for engine pods
  podLabels: {}
 
  ## Deployment update strategy
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
  updateStrategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate
 
  ## Affinity for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
 
  ## Node labels for pod assignment
  ## ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}
 
  ## Tolerations for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []
 
  ## Topology spread constraints for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
  topologySpreadConstraints: []
 
  ## Priority class for the pods
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
  priorityClassName: ""
 
  # Extra containers which runs as sidecar
  extraContainers: ""
  # extraContainers: |
  # - name: cloud-sql-proxy
  #   image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.1.2
  #   args:
  #     - --private-ip
  #     - --port=5432
  #     - example:europe-west3:grafana-oncall-db
 
  # Extra volume mounts for the main app container
  extraVolumeMounts: []
  # - mountPath: /mnt/postgres-tls
  #   name: postgres-tls
  # - mountPath: /mnt/redis-tls
  #   name: redis-tls
 
  # Extra volumes for the pod
  extraVolumes: []
  # - name: postgres-tls
  #   configMap:
  #     name: my-postgres-tls
  #     defaultMode: 0640
  # - name: redis-tls
  #   configMap:
  #     name: my-redis-tls
  #     defaultMode: 0640
 
detached_integrations_service:
  enabled: false
  type: LoadBalancer
  port: 8080
  annotations: {}
 
# Integrations pods configuration
detached_integrations:
  enabled: false
  replicaCount: 1
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
  ## Deployment update strategy
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
  updateStrategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate
 
  ## Affinity for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
 
  ## Node labels for pod assignment
  ## ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}
 
  ## Tolerations for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []
 
  ## Topology spread constraints for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
  topologySpreadConstraints: []
 
  ## Priority class for the pods
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
  priorityClassName: ""
 
  # Extra containers which runs as sidecar
  extraContainers: ""
  # extraContainers: |
  # - name: cloud-sql-proxy
  #   image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.1.2
  #   args:
  #     - --private-ip
  #     - --port=5432
  #     - example:europe-west3:grafana-oncall-db
 
  # Extra volume mounts for the container
  extraVolumeMounts: []
  # - mountPath: /mnt/postgres-tls
  #   name: postgres-tls
  # - mountPath: /mnt/redis-tls
  #   name: redis-tls
 
  # Extra volumes for the pod
  extraVolumes: []
  # - name: postgres-tls
  #   configMap:
  #     name: my-postgres-tls
  #     defaultMode: 0640
  # - name: redis-tls
  #   configMap:
  #     name: my-redis-tls
  #     defaultMode: 0640
 
# Celery workers pods configuration
celery:
  replicaCount: 1
  worker_queue: "default,critical,long,slack,telegram,webhook,celery,grafana,retry"
  worker_concurrency: "1"
  worker_max_tasks_per_child: "100"
  worker_beat_enabled: "True"
  ## Restart of the celery workers once in a given interval as an additional precaution to the probes
  ## If this setting is enabled TERM signal will be sent to celery workers
  ## It will lead to warm shutdown (waiting for the tasks to complete) and restart the container
  ## If this setting is set numbers of pod restarts will increase
  ## Comment this line out if you want to remove restarts
  worker_shutdown_interval: "65m"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 300
    timeoutSeconds: 10
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
  # Labels for celery pods
  podLabels: {}
 
  ## Affinity for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
 
  ## Node labels for pod assignment
  ## ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}
 
  ## Tolerations for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []
 
  ## Topology spread constraints for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
  topologySpreadConstraints: []
 
  ## Priority class for the pods
  ## ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
  priorityClassName: ""
 
  # Extra containers which runs as sidecar
  extraContainers: ""
  # extraContainers: |
  # - name: cloud-sql-proxy
  #   image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.1.2
  #   args:
  #     - --private-ip
  #     - --port=5432
  #     - example:europe-west3:grafana-oncall-db
 
  # Extra volume mounts for the main container
  extraVolumeMounts: []
  # - mountPath: /mnt/postgres-tls
  #   name: postgres-tls
  # - mountPath: /mnt/redis-tls
  #   name: redis-tls
 
  # Extra volumes for the pod
  extraVolumes: []
  # - name: postgres-tls
  #   configMap:
  #     name: my-postgres-tls
  #     defaultMode: 0640
  # - name: redis-tls
  #   configMap:
  #     name: my-redis-tls
  #     defaultMode: 0640
 
# Telegram polling pod configuration
telegramPolling:
  enabled: false
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
  # Labels for telegram-polling pods
  podLabels: {}
 
  # Extra volume mounts for the main container
  extraVolumeMounts: []
  # - mountPath: /mnt/postgres-tls
  #   name: postgres-tls
  # - mountPath: /mnt/redis-tls
  #   name: redis-tls
 
  # Extra volumes for the pod
  extraVolumes: []
  # - name: postgres-tls
  #   configMap:
  #     name: my-postgres-tls
  #     defaultMode: 0640
  # - name: redis-tls
  #   configMap:
  #     name: my-redis-tls
  #     defaultMode: 0640
 
oncall:
  # this is intended to be used for local development. In short, it will mount the ./engine dir into
  # any backend related containers, to allow hot-reloading + also run the containers with slightly modified
  # startup commands (which configures the hot-reloading)
  devMode: false
 
  # Override default MIRAGE_CIPHER_IV (must be 16 bytes long)
  # For existing installation, this should not be changed.
  # mirageCipherIV: 1234567890abcdef
  # oncall secrets
  secrets:
    # Use existing secret. (secretKey and mirageSecretKey is required)
    existingSecret: ""
    # The key in the secret containing secret key
    secretKey: ""
    # The key in the secret containing mirage secret key
    mirageSecretKey: ""
  # Slack configures the Grafana Oncall Slack ChatOps integration.
  slack:
    # Enable the Slack ChatOps integration for the Oncall Engine.
    enabled: false
    # clientId configures the Slack app OAuth2 client ID.
    # api.slack.com/apps/<yourApp> -> Basic Information -> App Credentials -> Client ID
    clientId: ~
    # clientSecret configures the Slack app OAuth2 client secret.
    # api.slack.com/apps/<yourApp> -> Basic Information -> App Credentials -> Client Secret
    clientSecret: ~
    # signingSecret - configures the Slack app signature secret used to sign
    # requests comming from Slack.
    # api.slack.com/apps/<yourApp> -> Basic Information -> App Credentials -> Signing Secret
    signingSecret: ~
    # Use existing secret for clientId, clientSecret and signingSecret.
    # clientIdKey, clientSecretKey and signingSecretKey are required
    existingSecret: ""
    # The key in the secret containing OAuth2 client ID
    clientIdKey: ""
    # The key in the secret containing OAuth2 client secret
    clientSecretKey: ""
    # The key in the secret containing the Slack app signature secret
    signingSecretKey: ""
    # OnCall external URL
    redirectHost: ~
  telegram:
    enabled: false
    token: ~
    webhookUrl: ~
    # Use existing secret. (tokenKey is required)
    existingSecret: ""
    # The key in the secret containing Telegram token
    tokenKey: ""
  smtp:
    enabled: true
    host: ~
    port: ~
    username: ~
    password: ~
    tls: ~
    ssl: ~
    fromEmail: ~
  exporter:
    enabled: false
    authToken: ~
  twilio:
    # Twilio account SID/username to allow OnCall to send SMSes and make phone calls
    accountSid: ""
    # Twilio password to allow OnCall to send SMSes and make calls
    authToken: ""
    # Number from which you will receive calls and SMS
    # (NOTE: must be quoted, otherwise would be rendered as float value)
    phoneNumber: ""
    # SID of Twilio service for number verification. You can create a service in Twilio web interface.
    # twilio.com -> verify -> create new service
    verifySid: ""
    # Twilio API key SID/username to allow OnCall to send SMSes and make phone calls
    apiKeySid: ""
    # Twilio API key secret/password to allow OnCall to send SMSes and make phone calls
    apiKeySecret: ""
    # Use existing secret for authToken, phoneNumber, verifySid, apiKeySid and apiKeySecret.
    existingSecret: ""
    # Twilio password to allow OnCall to send SMSes and make calls
    # The key in the secret containing the auth token
    authTokenKey: ""
    # The key in the secret containing the phone number
    phoneNumberKey: ""
    # The key in the secret containing verify service sid
    verifySidKey: ""
    # The key in the secret containing api key sid
    apiKeySidKey: ""
    # The key in the secret containing the api key secret
    apiKeySecretKey: ""
    # Phone notifications limit (the only non-secret value).
    # TODO: rename to phoneNotificationLimit
    limitPhone:
 
# Whether to run django database migrations automatically
migrate:
  enabled: true
  # TTL can be unset by setting ttlSecondsAfterFinished: ""
  ttlSecondsAfterFinished: 20
  # use a helm hook to manage the migration job
  useHook: false
  annotations: {}
 
  ## Affinity for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
 
  ## Node labels for pod assignment
  ## ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}
 
  ## Tolerations for pod assignment
  ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []
 
  # Extra containers which runs as sidecar
  extraContainers: ""
  # extraContainers: |
  # - name: cloud-sql-proxy
  #   image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.1.2
  #   args:
  #     - --private-ip
  #     - --port=5432
  #     - example:europe-west3:grafana-oncall-db
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
  # Extra volume mounts for the main container
  extraVolumeMounts: []
  # - mountPath: /mnt/postgres-tls
  #   name: postgres-tls
  # - mountPath: /mnt/redis-tls
  #   name: redis-tls
 
  # Extra volumes for the pod
  extraVolumes: []
  # - name: postgres-tls
  #   configMap:
  #     name: my-postgres-tls
  #     defaultMode: 0640
  # - name: redis-tls
  #   configMap:
  #     name: my-redis-tls
  #     defaultMode: 0640
 
# Sets environment variables with name capitalized and prefixed with UWSGI_,
# and dashes are substituted with underscores.
# see more: https://uwsgi-docs.readthedocs.io/en/latest/Configuration.html#environment-variables
# Set null to disable all UWSGI environment variables
uwsgi:
  listen: 128
 
# Additional env variables to add to deployments
env: {}
 
# Enable ingress object for external access to the resources
ingress:
  enabled: true
  #  className: ""
  annotations:
    kubernetes.io/ingress.class: "nginx"
  #  cert-manager.io/issuer: "letsencrypt-prod"
  tls:
    - hosts:
        - "{{ .Values.base_url }}"
      secretName: certificate-tls
  # Extra paths to prepend to the host configuration. If using something
  # like an ALB ingress controller, you may want to configure SSL redirects
  extraPaths: []
  # - path: /*
  #   backend:
  #     serviceName: ssl-redirect
  #     servicePort: use-annotation
  ## Or for k8s > 1.19
  # - path: /*
  #   pathType: Prefix
  #   backend:
  #     service:
  #       name: ssl-redirect
  #       port:
  #         name: use-annotation
 
# Whether to install ingress controller
ingress-nginx:
  enabled: false
 
# Install cert-manager as a part of the release
cert-manager:
  enabled: false
  # Instal CRD resources
  installCRDs: true
  webhook:
    timeoutSeconds: 30
    # cert-manager tries to use the already used port, changing to another one
    # https://github.com/cert-manager/cert-manager/issues/3237
    # https://cert-manager.io/docs/installation/compatibility/
    securePort: 10260
  # Fix self-checks https://github.com/jetstack/cert-manager/issues/4286
  podDnsPolicy: None
  podDnsConfig:
    nameservers:
      - 8.8.8.8
      - 1.1.1.1
 
database:
  # can be either mysql or postgresql
  type: mysql
 
# MySQL is included into this release for the convenience.
# It is recommended to host it separately from this release
# Set mariadb.enabled = false and configure externalMysql
mariadb:
  enabled: true
  image:
    repository: docker-0.unsee.tech/bitnami/mariadb
    tag: 10.11.4-debian-11-r0
  auth:
    database: oncall
    existingSecret:
  primary:
    extraEnvVars:
      - name: MARIADB_COLLATE
        value: utf8mb4_unicode_ci
      - name: MARIADB_CHARACTER_SET
        value: utf8mb4
  secondary:
    extraEnvVars:
      - name: MARIADB_COLLATE
        value: utf8mb4_unicode_ci
      - name: MARIADB_CHARACTER_SET
        value: utf8mb4
 
# Make sure to create the database with the following parameters:
# CREATE DATABASE oncall CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
externalMysql:
  host:
  port:
  db_name:
  user:
  password:
  # Use an existing secret for the mysql password.
  existingSecret:
  # The key in the secret containing the mysql username
  usernameKey:
  # The key in the secret containing the mysql password
  passwordKey:
  # Extra options (see example below)
  # Reference: https://pymysql.readthedocs.io/en/latest/modules/connections.html
  options:
  # options: >-
  #   ssl_verify_cert=true
  #   ssl_verify_identity=true
  #   ssl_ca=/mnt/mysql-tls/ca.crt
  #   ssl_cert=/mnt/mysql-tls/client.crt
  #   ssl_key=/mnt/mysql-tls/client.key
 
# PostgreSQL is included into this release for the convenience.
# It is recommended to host it separately from this release
# Set postgresql.enabled = false and configure externalPostgresql
postgresql:
  enabled: false
  auth:
    database: oncall
    existingSecret:
 
# Make sure to create the database with the following parameters:
# CREATE DATABASE oncall WITH ENCODING UTF8;
externalPostgresql:
  host:
  port:
  db_name:
  user:
  password:
  # Use an existing secret for the database password
  existingSecret:
  # The key in the secret containing the database password
  passwordKey:
  # Extra options (see example below)
  # Reference: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
  options:
  # options: >-
  #   sslmode=verify-full
  #   sslrootcert=/mnt/postgres-tls/ca.crt
  #   sslcert=/mnt/postgres-tls/client.crt
  #   sslkey=/mnt/postgres-tls/client.key
 
# RabbitMQ is included into this release for the convenience.
# It is recommended to host it separately from this release
# Set rabbitmq.enabled = false and configure externalRabbitmq
rabbitmq:
  enabled: true
  image:
    repository: docker-0.unsee.tech/bitnami/rabbitmq
    tag: 3.12.0-debian-11-r0
  auth:
    existingPasswordSecret:
 
broker:
  type: rabbitmq
 
externalRabbitmq:
  host:
  port:
  user:
  password:
  protocol:
  vhost:
  # Use an existing secret for the rabbitmq password
  existingSecret:
  # The key in the secret containing the rabbitmq password
  passwordKey: ""
  # The key in the secret containing the rabbitmq username
  usernameKey: username
 
# Redis is included into this release for the convenience.
# It is recommended to host it separately from this release
redis:
  enabled: true
  image:
    repository: docker-0.unsee.tech/bitnami/redis
    tag: 6.2.7-debian-11-r11
  auth:
    existingSecret:
 
externalRedis:
  protocol:
  host:
  port:
  database:
  username:
  password:
  # Use an existing secret for the redis password
  existingSecret:
  # The key in the secret containing the redis password
  passwordKey:
 
  # SSL options
  ssl_options:
    enabled: false
    # CA certificate
    ca_certs:
    # Client SSL certs
    certfile:
    keyfile:
    # SSL verification mode: "cert_none" | "cert_optional" | "cert_required"
    cert_reqs:
 
# Grafana is included into this release for the convenience.
# It is recommended to host it separately from this release
grafana:
  enabled: true
  grafana.ini:
    server:
      domain: grafana-oncall.qifu.com
      root_url: "%(protocol)s://%(domain)s/grafana/"
      serve_from_sub_path: true
    feature_toggles:
      enable: externalServiceAccounts
      accessControlOnCall: false
  env:
    GF_AUTH_MANAGED_SERVICE_ACCOUNTS_ENABLED: true
  persistence:
    enabled: true
  # Disable psp as PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
  rbac:
    pspEnabled: false
  plugins:
    - grafana-oncall-app
  extraVolumes:
    - name: provisioning
      configMap:
        name: helm-testing-grafana-plugin-provisioning
  extraVolumeMounts:
    - name: provisioning
      mountPath: /etc/grafana/provisioning/plugins/grafana-oncall-app-provisioning.yaml
      subPath: grafana-oncall-app-provisioning.yaml
 
externalGrafana:
  # Example: https://grafana.mydomain.com
  url:
 
nameOverride: ""
fullnameOverride: ""
 
serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""
 
podAnnotations: {}
 
podSecurityContext:
  {}
  # fsGroup: 2000
 
securityContext:
  {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsGroup: 2000
  # runAsUser: 1000
 
init:
  securityContext:
    {}
    # allowPrivilegeEscalation: false
    # capabilities:
    #   drop:
    #   - ALL
    # privileged: false
    # readOnlyRootFilesystem: true
    # runAsGroup: 2000
    # runAsNonRoot: true
    # runAsUser: 1000
  resources:
    {}
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
 
ui:
  # this is intended to be used for local development. In short, it will spin up an additional container
  # running the plugin frontend, such that hot reloading can be enabled
  enabled: false
  image:
    repository: oncall/ui
    tag: dev
  # Additional env vars for the ui container
  env: {}
 
prometheus:
  enabled: false
  # extraScrapeConfigs: |
  #   - job_name: 'oncall-exporter'
  #     metrics_path: /metrics/
  #     static_configs:
  #       - targets:
  #         - oncall-dev-engine.default.svc.cluster.local:8080

遇到的问题

oncall模板识别payload错误

配置完之后由于Skywalking发送的payload如下：

[
        {
            "scopeId": 1,
            "scope": "SERVICE",
            "name": "qifu-saas-gateway-test",
            "id0": "cWlmdS1zYWFzLWdhdGV3YXktdGVzdA==.1",
            "id1": "",
            "ruleName": "service_resp_time_percentile_rule",
            "alarmMessage": "最近3分钟的服务 qifu-saas-gateway-test 的响应时间百分比超过1秒",
            "tags": [],
            "startTime": 1744275620189
        }
    ]

会导致template识别报错：

所以我用python写一个中转节点程序来格式化Skywalking原始告警payload并把告警时间戳转为CST时间，整体就是Skywalking发送告警到中转节点，格式化之后再把告警信息转发到Granfana-oncall，

skywalking-tra.py：

# -*- coding: utf-8 -*-
from flask import Flask, request
import requests
from datetime import datetime, timezone, timedelta
 
app = Flask(__name__)

#Granfana Oncall集成的Endpoint：
TARGET_URL = "http://10.168.2.236:8080/integrations/v1/webhook/0DIaZCp09dNem7g4P30PSR4KO/"
 
def convert_timestamp_to_cst(timestamp_ms):
    """将毫秒时间戳转换为中国标准时间 (UTC+8) 的字符串"""
    try:
        # 转换为秒（保留小数部分）
        timestamp_sec = timestamp_ms / 1000.0
        # 创建UTC+8时区
        cst_timezone = timezone(timedelta(hours=8))
        # 生成datetime对象并转换时区
        dt = datetime.fromtimestamp(timestamp_sec, tz=timezone.utc).astimezone(cst_timezone)
        # 格式化为字符串（示例：2024-01-01 12:34:56）
        return dt.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(f"时间转换失败: {str(e)}")
        return None
 
@app.route('/alert', methods=['POST'])
def handle_alert():
    original_data = request.json
     
    # 遍历每个告警项，添加转换后的时间
    for alert in original_data:
        timestamp_ms = alert.get("startTime")
        if timestamp_ms:
            formatted_time = convert_timestamp_to_cst(timestamp_ms)
            if formatted_time:
                # 添加新字段（保留原时间戳）
                alert["startTimeUTC8"] = formatted_time
     
    converted_data = {"alters": original_data}
     
    try:
        response = requests.post(
            TARGET_URL,
            json=converted_data,
            headers={'Content-Type': 'application/json'},
            timeout=10
        )
        response.raise_for_status()
        return "Forward success", 200
    except Exception as e:
        print(f"转发失败: {str(e)}")
        return "Forward failed", 500
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

安装好依赖：

1	pip3 install flask requests datetime

运行：

1	python3 skywaliking-tra.py

修改Skywalking的配置文件webhook为中转节点程序：

alarm-settings.yml：

1 2	webhooks: - http://10.168.2.126:5000/alert

重启Skywalking服务以生效。

现在的payload信息grafana-oncall可以识别了：

由于修改了payload的格式，所以对应的PrometheusAlert的模板也要修改，不然会导致识别不到字段从而发送失败

Skywalking到中转节点到grafana-oncall再到企微：

URL错误

发现邮件告警的链接之前是在服务器配置的url，本地会打不开：

配置了Granfana-oncall的对外访问链接之后，可以修改URL为对外访问的URL：

1
2
3

curl -X POST 'http://admin:admin@localhost:3000/api/plugins/grafana-oncall-app/settings' -H "Content-Type: application/json" -d '{"enabled":true, "jsonData":{"stackId":5, "orgId":100, "onCallApiUrl":"http://engine:8080", "grafanaUrl":"https://oncall.example.com"}}'
 
curl -X POST 'http://admin:admin@localhost:3000/api/plugins/grafana-oncall-app/resources/plugin/install'

个性化配置

告警内容简介美化

因为我们是配置了通过邮件方式发送告警给值班人员，发送的内容如下：

告警内容有些字段我们是不需要的，我们可以通过配置模板来实现美化简介告警内容：

模板配置如下：

看一下效果：

根据服务名路由告警

现在可以通过创建不同的团队，创建对应的团队的升级链，然后通过配置集成的路由来实现对应的告警发给对应的升级链：

比如我配置的模板，只要payload.alters的name字段包含qifu-saas-cbl-application或qifu-saas-gateway或qifu-saas-tms就发送到ops的升级链：

配置完效果：

拆分告警组

skywalking默认同一条告警会包含多组告警信息：

正常情况没啥问题，但是在这里我们需要根据服务名来区别，不同的人员接收到他们自己的告警信息，如果合并在同一条告警里，可能只有1组告警是需要我负责的，但是现在3组信息我都看到了，所以在这里对webhook进行改造，把告警信息进行拆分，每条告警只包含一组信息：

skywalking-tra.py：

# -*- coding: utf-8 -*-
from flask import Flask, request
import requests
from datetime import datetime, timezone, timedelta
 
app = Flask(__name__)
 
TARGET_URL = "http://172.28.81.143:8080/integrations/v1/webhook/crzkWCbF2KNGLQmEsLd8X47Pp/"
 
def convert_timestamp_to_cst(timestamp_ms):
    """将毫秒时间戳转换为中国标准时间 (UTC+8) 的字符串"""
    try:
        timestamp_sec = timestamp_ms / 1000.0
        cst_timezone = timezone(timedelta(hours=8))
        dt = datetime.fromtimestamp(timestamp_sec, tz=timezone.utc).astimezone(cst_timezone)
        return dt.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(f"时间转换失败: {str(e)}")
        return None
 
@app.route('/alert', methods=['POST'])
def handle_alert():
    # 获取原始告警数据（注意：Skywalking 的告警数据格式是[...]）
    original_data = request.json
    # 提取告警数组
    alerts = original_data  # 关键修改：获取数组
 
    # 遍历每个告警项，逐个转发
    for alert in alerts:
        # 添加转换后的时间
        timestamp_ms = alert.get("startTime")
        if timestamp_ms:
            formatted_time = convert_timestamp_to_cst(timestamp_ms)
            if formatted_time:
                alert["startTimeUTC8"] = formatted_time  # 直接修改单个告警对象
 
        # 构造单个告警的请求体（保持原结构，但 alters 只包含当前告警）
        single_alert_payload = {"alters": [alert]}  # 关键修改：单个告警包装成数组
 
        try:
            # 转发单个告警
            response = requests.post(
                TARGET_URL,
                json=single_alert_payload,  # 发送单个告警
                headers={'Content-Type': 'application/json'},
                timeout=10
            )
            response.raise_for_status()
            print(f"告警转发成功: {alert.get('name')}")
        except Exception as e:
            print(f"告警转发失败（{alert.get('name')}）: {str(e)}")
            # 可以选择继续处理后续告警（不 return，继续循环）
 
    return "所有告警处理完成", 200  # 统一返回成功（即使部分失败）
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

测试告警信息是否分开发送：

curl -X POST http://localhost:5000/alert -H "Content-Type: application/json" -d '[
  {"alarmMessage": "测试告警3", "startTime": 1744869502595},
  {"alarmMessage": "测试告警4", "startTime": 1744869502595}
]'

优化端点类型的告警

oncall平台通过skywalking告警payload的name字段的值来路由给对应的负责人员，正常的告警没有问题，会通过我们编写的路由模板来匹配给对应的升级链：

但是如果是端点类型的告警，会出现两个服务名称：

这种情况是qifu-saas-gateway服务去请求qifu-saas-bc服务的接口响应时间过长，需要告警给负责qifu-saas-bc服务的人员，所以还需要对webhook进行改造：

思路：新添加一个routeName字段，如果name字段不包含to字符，routeName就等于name，如果包含to字符（包含空格），就以in为分隔符，取最后一列：

skywalking-tra.py：

# -*- coding: utf-8 -*-
from flask import Flask, request
import requests
from datetime import datetime, timezone, timedelta
 
app = Flask(__name__)
 
TARGET_URL = "http://172.28.81.143:8080/integrations/v1/webhook/crzkWCbF2KNGLQmEsLd8X47Pp/"
 
def convert_timestamp_to_cst(timestamp_ms):
    """将毫秒时间戳转换为中国标准时间 (UTC+8) 的字符串"""
    try:
        timestamp_sec = timestamp_ms / 1000.0
        cst_timezone = timezone(timedelta(hours=8))
        dt = datetime.fromtimestamp(timestamp_sec, tz=timezone.utc).astimezone(cst_timezone)
        return dt.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(f"时间转换失败: {str(e)}")
        return None
 
@app.route('/alert', methods=['POST'])
def handle_alert():
    # 获取原始告警数据（注意：Skywalking 的告警数据格式是 {"alters": [...]}）
    original_data = request.json
    # 提取告警数组
    alerts = original_data  # 关键修改：获取数组
 
    # 遍历每个告警项，逐个转发
    for alert in alerts:
        # 添加转换后的时间
        timestamp_ms = alert.get("startTime")
        if timestamp_ms:
            formatted_time = convert_timestamp_to_cst(timestamp_ms)
            if formatted_time:
                alert["startTimeUTC8"] = formatted_time  # 直接修改单个告警对象
 
        # ========================= 新增逻辑：添加 routeName 字段 =========================
        name = alert.get("name", "")
        route_name = name  # 默认值
        if " to " in name:  # 如果包含 " to "（注意前后空格）
            # 分割字符串（示例：A in B to C in D → ["A", "B to C", "D"]）
            parts = name.split(" in ")
            if len(parts) >= 1:
                route_name = parts[-1].strip()  # 取最后一个部分（如 "D"）
        # 添加字段到告警数据
        alert["routeName"] = route_name
        # ========================= 新增逻辑结束 =========================
 
        # 构造单个告警的请求体（保持原结构，但 alters 只包含当前告警）
        single_alert_payload = {"alters": [alert]}  # 关键修改：单个告警包装成数组
 
        try:
            # 转发单个告警
            response = requests.post(
                TARGET_URL,
                json=single_alert_payload,  # 发送单个告警
                headers={'Content-Type': 'application/json'},
                timeout=10
            )
            response.raise_for_status()
            print(f"告警转发成功: {alert.get('name')}")
        except Exception as e:
            print(f"告警转发失败（{alert.get('name')}）: {str(e)}")
 
    return "所有告警处理完成", 200  # 统一返回成功（即使部分失败）
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

然后再修改路由模板，根据routeName来路由：

Graylog日志告警抑制

Graylog有ERROR日志的话就会触发告警，但是会有多个相同的告警频繁触发，导致告警很多，通过接入一个中转webhook对告警信息的app，namespace和message字段的类名+行号来生成一个告警ID，十分钟内如果告警ID相同的话，就不再重复发送告警：

# -*- coding: utf-8 -*-
from flask import Flask, request, jsonify
import re
import threading
import requests
from datetime import datetime, timedelta
from expiringdict import ExpiringDict

app = Flask(__name__)

# 目标 Grafana 地址
GRAFANA_WEBHOOK_URL = "http://172.28.81.143:8080/integrations/v1/webhook/766yLECYaHhZj9kO9bfR6GlVV/"

# 缓存配置：10分钟过期，最大存储1000个键（避免内存溢出）
ALERT_CACHE = ExpiringDict(max_len=1000, max_age_seconds=600)

# 线程锁对象
alert_lock = threading.Lock()

def extract_class_name(message):
    """
    从日志消息中提取Java类名+行号（例如：com.xxx.xxx:行号）
    匹配格式：[com.xxx.xxx:行号]
    """   
    match = re.search(r'\[([a-zA-Z0-9_.]+):(\d+)\]', message)
    if match:
        return f"{match.group(1)}:{match.group(2)}"  # 格式：类名:行号
    else:
        return "unknown:0"  # 兜底值

    """
    从日志消息中提取Java类名（例如：com.xxx.xxx）
    匹配格式：[com.xxx.xxx:行号]
    return match.group(1) if match else "unknown"
    """
def generate_alert_id(payload):
    try:
        backlog = payload.get("backlog", [{}])
        fields = backlog[0].get("fields", {})
        app = fields.get("app", "unknown_app")
        namespace = fields.get("namespace", "unknown_namespace")
        message = backlog[0].get("message", "")
        class_name = extract_class_name(message)
        return f"{app}_{namespace}_{class_name}"
    except Exception as e:
        # 防止意外崩溃，返回异常兜底标识
        return f"{app}_{namespace}_error"

@app.route('/webhook', methods=['POST'])
def handle_webhook():
    try:
        payload = request.json
        alert_id = generate_alert_id(payload)
        
        # 使用线程锁包裹整个处理逻辑
        with alert_lock:  # 原子操作锁
            # 检查是否已存在且未过期
            if alert_id in ALERT_CACHE:
                print(f"告警已抑制: {alert_id}")
                return jsonify({"status": "suppressed"}), 200
            
            # 转发到 Grafana
            response = requests.post(GRAFANA_WEBHOOK_URL, json=payload)
            if response.status_code == 200:
                ALERT_CACHE[alert_id] = True  # 记录到缓存
                print(f"告警已转发: {alert_id}")
                return jsonify({"status": "forwarded"}), 200
            else:
                return jsonify({"error": "forward failed"}), 500            

    except Exception as e:
        print(f"处理异常: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001, debug=False)

注：

有些ERROR日志很长，通过邮箱告警的话没问题，但是如果想通过企微告警，会有长度限制（4096字节），而且消息太长也不美观，所以可以通过中转webhook，对消息进行截取：

# -*- coding: utf-8 -*-
from flask import Flask, request, jsonify
import re
import threading
import requests
from datetime import datetime, timedelta
from expiringdict import ExpiringDict
 
app = Flask(__name__)
 
# 目标 Grafana 地址
GRAFANA_WEBHOOK_URL = "http://172.28.81.143:8080/integrations/v1/webhook/766yLECYaHhZj9kO9bfR6GlVV/"
 
# 缓存配置：10分钟过期，最大存储1000个键（避免内存溢出）
ALERT_CACHE = ExpiringDict(max_len=1000, max_age_seconds=600)
 
# 线程锁对象
alert_lock = threading.Lock()
 
def extract_class_name(message):
    """
    从日志消息中提取Java类名+行号（例如：com.xxx.xxx:行号）
    匹配格式：[com.xxx.xxx:行号]
    """  
    match = re.search(r'\[([a-zA-Z0-9_.]+):(\d+)\]', message)
    if match:
        return f"{match.group(1)}:{match.group(2)}"  # 格式：类名:行号
    else:
        return "unknown:0"  # 兜底值
 
    """
    从日志消息中提取Java类名（例如：com.xxx.xxx）
    匹配格式：[com.xxx.xxx:行号]
    return match.group(1) if match else "unknown"
    """
def generate_alert_id(payload):
    try:
        backlog = payload.get("backlog", [{}])
        fields = backlog[0].get("fields", {})
        app = fields.get("app", "unknown_app")
        namespace = fields.get("namespace", "unknown_namespace")
        message = backlog[0].get("message", "")
        class_name = extract_class_name(message)
        return f"{app}_{namespace}_{class_name}"
    except Exception as e:
        # 防止意外崩溃，返回异常兜底标识
        return f"{app}_{namespace}_error"
def truncate_message(message, max_length=300):
    """
    截取消息以确保其长度不超过最大限制
    """
    if len(message) > max_length:
        return message[:max_length] + f"...\n[消息截断 总长度:{len(message)} > 限制:{max_length}]"
    return message
 
 
@app.route('/webhook', methods=['POST'])
def handle_webhook():
    try:
        payload = request.json
        alert_id = generate_alert_id(payload)
 
        # 截取 message 字段
        backlog = payload.get("backlog", [{}])
        if backlog:
            backlog[0]['message'] = truncate_message(backlog[0].get("message", ""))
         
 
        # 使用线程锁包裹整个处理逻辑
        with alert_lock:  # 原子操作锁
            # 检查是否已存在且未过期
            if alert_id in ALERT_CACHE:
                print(f"告警已抑制: {alert_id}")
                return jsonify({"status": "suppressed"}), 200
             
            # 转发到 Grafana
            response = requests.post(GRAFANA_WEBHOOK_URL, json=payload)
            if response.status_code == 200:
                ALERT_CACHE[alert_id] = True  # 记录到缓存
                print(f"告警已转发: {alert_id}")
                return jsonify({"status": "forwarded"}), 200
            else:
                return jsonify({"error": "forward failed"}), 500           
 
    except Exception as e:
        print(f"处理异常: {str(e)}")
        return jsonify({"error": str(e)}), 500
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001, debug=False)

优化跳转链接时间范围

之前的Graylog告警点击告警链接取的值是当前时间的前5分钟，如果超过了5分钟才看到告警，此时点击链接跳转就找不到对应的日志了，还得手动去调时间范围，所以再改造webhook，生成两个新的字段，取值为告警时间的前后一分钟：

# -*- coding: utf-8 -*-
from flask import Flask, request, jsonify
import re
import threading
import requests
from datetime import datetime, timedelta
from expiringdict import ExpiringDict
 
app = Flask(__name__)
 
# 目标 Grafana 地址
GRAFANA_WEBHOOK_URL = "http://172.28.81.143:8080/integrations/v1/webhook/766yLECYaHhZj9kO9bfR6GlVV/"
 
# 缓存配置：10分钟过期，最大存储1000个键（避免内存溢出）
ALERT_CACHE = ExpiringDict(max_len=1000, max_age_seconds=600)
 
# 线程锁对象
alert_lock = threading.Lock()
 
def extract_class_name(message):
    """
    从日志消息中提取Java类名+行号（例如：com.xxx.xxx:行号）
    匹配格式：[com.xxx.xxx:行号]
    """  
    match = re.search(r'\[([a-zA-Z0-9_.]+):(\d+)\]', message)
    if match:
        return f"{match.group(1)}:{match.group(2)}"  # 格式：类名:行号
    else:
        return "unknown:0"  # 兜底值
 
    """
    从日志消息中提取Java类名（例如：com.xxx.xxx）
    匹配格式：[com.xxx.xxx:行号]
    return match.group(1) if match else "unknown"
    """
def generate_alert_id(payload):
    try:
        backlog = payload.get("backlog", [{}])
        fields = backlog[0].get("fields", {})
        app = fields.get("app", "unknown_app")
        namespace = fields.get("namespace", "unknown_namespace")
        message = backlog[0].get("message", "")
        class_name = extract_class_name(message)
        return f"{app}_{namespace}_{class_name}"
    except Exception as e:
        # 防止意外崩溃，返回异常兜底标识
        return f"{app}_{namespace}_error"
         
 
def process_timestamp(timestamp_str):
    """处理时间戳生成前后1分钟范围"""
    try:
        dt = datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S.%fZ")
         
        # 生成时间范围
        before = dt - timedelta(minutes=1)
        after = dt + timedelta(minutes=1)
         
        # URL编码时间格式
        def encode_time(t):
            return t.strftime("%Y-%m-%dT%H:%M:%S.%fZ").replace(":", "%3A")
         
        return {
            "before": encode_time(before)[:-4] + "Z",  # 保持3位毫秒
            "after": encode_time(after)[:-4] + "Z"
        }
    except Exception as e:
        print(f"时间处理失败: {str(e)}")
        return {"before": "", "after": ""}
 
         
 
def truncate_message(message, max_length=500):
    """
    截取消息以确保其长度不超过最大限制
    """
    if len(message) > max_length:
        return message[:max_length] + f"...\n[消息截断 总长度:{len(message)} > 限制:{max_length}]"
    return message
 
 
def extract_description(message):
    """提取description字段"""
    try:
        match = re.search(r'\[[a-zA-Z0-9_.]+:\d+\]\s*-\s*(.*)', message)
        if match:
            return match.group(1)[:180]  # 截取180个字符
        else:
            return "无法提取描述信息"
    except Exception as e:
        return f"提取描述信息失败: {e}"
 
 
 
@app.route('/webhook', methods=['POST'])
def handle_webhook():
    try:
        payload = request.json
        # 处理每条日志
        if 'backlog' in payload:
            for log_entry in payload['backlog']:
                # 1. 消息截断
                if 'message' in log_entry:
                    log_entry['message'] = truncate_message(log_entry['message'])
                 
                # 2. 添加时间范围字段
                if 'timestamp' in log_entry:
                    time_range = process_timestamp(log_entry['timestamp'])
                    if not log_entry.get('fields'):
                        log_entry['fields'] = {}
                    log_entry['fields'].update({
                        "before_timestamp": time_range['before'],
                        "after_timestamp": time_range['after']
                    })
        # 3. 添加描述字段
                if 'message' in log_entry:
                    log_entry['fields']['description'] = extract_description(log_entry['message'])             
         
         
        alert_id = generate_alert_id(payload)
         
 
        # 使用线程锁包裹整个处理逻辑
        with alert_lock:  # 原子操作锁
            # 检查是否已存在且未过期
            if alert_id in ALERT_CACHE:
                print(f"告警已抑制: {alert_id}")
                return jsonify({"status": "suppressed"}), 200
             
            # 转发到 Grafana
            response = requests.post(GRAFANA_WEBHOOK_URL, json=payload)
            if response.status_code == 200:
                ALERT_CACHE[alert_id] = True  # 记录到缓存
                print(f"告警已转发: {alert_id}")
                return jsonify({"status": "forwarded"}), 200
            else:
                return jsonify({"error": "forward failed"}), 500           
 
    except Exception as e:
        print(f"处理异常: {str(e)}")
        return jsonify({"error": str(e)}), 500
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001, debug=False)

Graylog的MySQL慢日志告警

添加了Graylog收集的MySQL慢日志告警，直接发送的payload还是有message过长，时间长告警链接跳转找不到对应日志得到问题，所以也通过中转webhook来处理payload，格式化为我们需要的格式之后再发送到oncall平台（添加了告警时间的前后一分钟字段，message过长截断以及处理慢日志里存在的###被Markdown解析为标题的问题）：

# -*- coding: utf-8 -*-
from flask import Flask, request, jsonify
import requests
from datetime import datetime, timedelta
import html  # 用于HTML转义（可选）
 
app = Flask(__name__)
TARGET_URL = "http://172.28.81.143:8080/integrations/v1/webhook/766yLECYaHhZj9kO9bfR6GlVV/"
 
def process_timestamps(payload):
    """处理payload添加时间范围字段"""
    for entry in payload.get('backlog', []):
        fields = entry.setdefault('fields', {})
        if 'timestamp' in entry:
            try:
                original_time = datetime.strptime(entry['timestamp'], "%Y-%m-%dT%H:%M:%S.%fZ")
                before_time = original_time - timedelta(minutes=1)
                after_time = original_time + timedelta(minutes=1)
                def format_graylog_time(t):
                    return t.strftime("%Y-%m-%dT%H:%M:%S.%fZ").replace(":", "%3A")
                fields.update({
                    "before_timestamp": format_graylog_time(before_time)[:23] + "Z",
                    "after_timestamp": format_graylog_time(after_time)[:23] + "Z"
                })
            except Exception as e:
                print(f"时间处理失败: {str(e)}")
                fields.update({
                    "before_timestamp": "invalid_time",
                    "after_timestamp": "invalid_time"
                })
 
@app.route('/graylog-webhook', methods=['POST'])
def graylog_handler():
    try:
        payload = request.json
        process_timestamps(payload)
         
        # 处理message字段：转义特殊字符并截断
        for entry in payload.get('backlog', []):
            if 'message' in entry:
                # 转义#防止Markdown格式解析
                message = entry['message'].replace('#', '\\#')
                # 可选：HTML转义（根据告警系统支持情况）
                # message = html.escape(message)
                # 截断超过500字符的部分
                if len(message) > 500:
                    message = message[:500] + "..."
                entry['message'] = message
         
        resp = requests.post(TARGET_URL, json=payload, timeout=5)
        return jsonify({"status": "forwarded", "code": resp.status_code}), resp.status_code
         
    except Exception as e:
        return jsonify({"error": str(e)}), 500
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5002)