Verified Commit 8771aa66 authored by anarcat's avatar anarcat
Browse files

donate: change alerting thresholds for card testing

We have had a couple incidents recently where we had card testing but
those alerts never fired.

I believe the thresholds were not set correctly. First, they were kind
of too tolerant, but also, the `for` mixed with a `rate()` query meant
that we never had the sustained load to trigger the alert.

So, instead, we sample over a longer period (one hour) and remove the
`for` threshold entirely.

We still check only for failed transactions with vendors: added
together, we frequently cross that threshold otherwise, and that's normal.
parent a60d307c
Loading
Loading
Loading
Loading
Loading
+19 −16
Original line number Diff line number Diff line
@@ -19,35 +19,38 @@ groups:
        minutes, of type {{ $labels.type }}.
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/donate#errors-and-exceptions"

  - alert: DonateHighFailureRate
    expr: sum(increase(donate_transaction_count_total{namespace="prod",status="Failed"}[10m])) > 10
    for: 1h
  - alert: DonateHighTransactionRate
    expr: sum(increase(donate_transaction_count_total{namespace="prod"}[1h])) > 50
    labels:
      severity: warning
    annotations:
      summary: "Unusually high failure rate on donate.torproject.org"
      summary: "Unusual transaction rate on donate.torproject.org"
      description: |
        More than 1 failed transaction per minute for the last hour
        has been detected on the production donate.torproject.org
        site. Last increase per 10 minutes is {{ $value }} and Stripe
        charges us 0.02$ per failed transaction.
        More than 50 transactions per hour have been detected on the
        production donate.torproject.org site. Last increase per hour
        is {{ $value | humanize }} and Stripe charges us 0.02$ per transaction.

        Those are transactions that have succesfully passed the
        payment vendor and have been confirmed in CiviCRM.

        This might be card testing and should be investigated
        promptly.
      playbook: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/donate#stripe-card-testing

  - alert: DonateHighVendorFailureRate
    expr: sum(increase(donate_vendor_transaction_count_total{namespace="prod",status="failure"}[5m])) > 10
    for: 15m
  - alert: DonateHighVendorRate
    expr: sum(increase(donate_vendor_transaction_count_total{namespace="prod",status="failure"}[1h])) > 20
    labels:
      severity: warning
    annotations:
      summary: "Unusually high vendor failure rate on donate.torproject.org"
      summary: "Unusually high vendor transaction rate on donate.torproject.org"
      description: |
        More than 1 failed transaction per minute for the last hour
        has been detected on the production donate.torproject.org
        site. Last increase per 10 minutes is {{ $value }} and Stripe
        charges us 0.02$ per failed transaction.
        More than 10 transaction per hour have been detected on the
        production donate.torproject.org site. Last increase per hour
        is {{ $value | humanize }} and Stripe charges us 0.02$ per failed
        transaction.

        This is at the stage where we validate the transaction with
        the vendor (Stripe, Paypal, etc).
        
        This might be card testing and should be investigated
        promptly.
+12 −10
Original line number Diff line number Diff line
@@ -37,26 +37,28 @@ tests:
    input_series:
      # problems should be notified
      - series: 'donate_transaction_count_total{alias="donate-01.torproject.org",app="donate-neo",instance="donate.torproject.org:443",job="donate_neo",namespace="prod",status="Failed",team="TPA",type="recurring"}'
        values: '0+10x61'
        values: '0+50x61'
      # error rate not sustained for 1h
      - series: 'donate_transaction_count_total{alias="donate-01.torproject.org",app="donate-nonexistent",instance="donate.torproject.org:443",job="donate_neo",namespace="prod",status="Failed",team="TPA",type="recurring"}'
        values: '0+2x10 21x10 22+2x41'
        values: '0+2x10 50x10 22+2x41'
      # would trigger, but not in namespace prod
      - series: 'donate_transaction_count_total{alias="donate-01.torproject.org",app="donate-neo-staging",instance="donate.torproject.org:443",job="donate_neo",namespace="staging",status="Failed",team="TPA",type="recurring"}'
        values: '0+2x61'
        values: '0+50x61'
    alert_rule_test:
      - eval_time: 1h1m
        alertname: DonateHighFailureRate
        alertname: DonateHighTransactionRate
        exp_alerts:
          - exp_labels:
              severity: warning
            exp_annotations:
              summary: "Unusually high failure rate on donate.torproject.org"
              summary: "Unusual transaction rate on donate.torproject.org"
              description: |
                More than 1 failed transaction per minute for the last hour
                has been detected on the production donate.torproject.org
                site. Last increase per 10 minutes is 120 and Stripe
                charges us 0.02$ per failed transaction.
                More than 50 transactions per hour have been detected on the
                production donate.torproject.org site. Last increase per hour
                is 3.148k and Stripe charges us 0.02$ per transaction.
        
                Those are transactions that have succesfully passed the
                payment vendor and have been confirmed in CiviCRM.
        
                This might be card testing and should be investigated
                promptly.