6-Month Report — October 2025 to March 2026
| Endpoint | Total Calls | Errors | Success Rate | Avg Response (s) | Monthly Trend (Oct–Mar) |
|---|
| # | Club | State | Errors | First Error | Last Error |
|---|
240 time validation errors from a single entity (Clifton Beach Lifeguard). Anomalous spike caused by misconfigured patrol time submissions that triggered validation failures repeatedly.
Resolved by January 2026Unindexed database queries caused server overload during peak patrol period (816 + 827 concurrent patrols). Cascading 502/503 errors across multiple endpoints until emergency indexing was applied.
Resolved in v2.13.0| Priority | Action Item | Expected Impact |
|---|---|---|
| HIGH | Continue database query optimization and indexing strategy. Monitor slow query log and add indexes proactively for new features. | Prevent repeat of Jan 17-18 outage pattern |
| HIGH | Implement rate limiting and circuit breakers for /sso/auth endpoint (59.2% success rate is critically low). | Reduce auth cascade failures, improve user experience |
| HIGH | Add automated alerting for error rate spikes above 3% on any endpoint within a 15-minute window. | Faster incident detection and response times |
| MED | Review /patrol-log/create validation logic. Success rate declining from 91.5% to 80.6% over 6 months. | Arrest declining patrol log creation reliability |
| MED | Implement client-side validation for "signed on elsewhere" and "duplicate patrol log" errors to reduce unnecessary API calls. | Reduce 40% of position-related errors at source |
| MED | Add structured error tracking per club to enable proactive outreach for clubs with persistent issues. | Reduce support burden, improve club satisfaction |
| LOW | Investigate 502/524 timeout errors — consider increasing upstream timeouts or adding request queuing for heavy operations. | Reduce intermittent gateway errors during peak load |
| LOW | Implement API versioning strategy to support backward-compatible changes without breaking existing integrations. | Safer deployments, reduced regression risk |
Error rate per hour across all 6 months. Note: times are server time (AEST/Australia Sydney).
Total API errors per day of week across all 6 months.
When do sign-on and sign-off errors occur? This reveals the operational rhythm.
api:patrols, api:entities, api:members all run between 7–9pm. The /service-status-details endpoint alone accounts for 4,125 of the 4,857 errors at 7pm, all returning "No available logClubs found" (404) — expected for patrols that haven't been signed on yet.
Using user_id presence to distinguish user-triggered vs system/cron errors.
| Source | Endpoint | Code | Errors | User Impact |
|---|---|---|---|---|
| User-Triggered — Users see these errors | ||||
| User | /patrol-log-members | 400 | 886 | Position limit / signed elsewhere / time error |
| User | /patrol-log/create | 403 | 637 | "Patrol log already exists" — confusing for user |
| User | /sign-on-lifeguard | 400 | 292 | "Primary contact required" — form incomplete |
| User | /incident | 400 | 235 | Field too long / undefined injury — form error |
| User | /patrol-log-members | 404 | 223 | "Member not found" — stale data shown to user |
| User | /patrol-team | 400 | 132 | Time conflict / duplicate patrol time |
| User | /patrol-log/create | 524 | 100 | Timeout — user sees spinner then generic error |
| User | /sign-on-club | 400 | 85 | "Already signed on" — double-click |
| Cron/System — Users never see these | ||||
| Cron | /sso/auth | 404 | 4,602 | Silent — member sync login attempts |
| Cron | /service-status-details | 404 | 1,328 | Silent — checking unsigned patrols |
| Cron | /sso/auth | 403 | 150 | Silent — password change needed |
| Cron | /sign-off-lifeguard | 400 | 114 | Silent — auto sign-off attempts |
| Cron | /patrol-log/check-exists | 403 | 47 | Silent — patrol log validation |
When each error pattern was first and last observed. Span shows how long the issue has persisted. Green = resolved, Red = still active.
| Error Pattern | Endpoint | Code | Count | First Seen | Last Seen | Span | Status |
|---|---|---|---|---|---|---|---|
| Position limit: 3 | /patrol-log-members | 400 | 669 | 2025-10-03 | 2026-02-28 | 148d | ACTIVE |
| Primary contact validation | /sign-on-lifeguard | 400 | 548 | 2025-09-30 | 2026-02-26 | 149d | ACTIVE |
| Signed on elsewhere | /patrol-log-members | 400 | 519 | 2025-09-30 | 2026-02-28 | 151d | ACTIVE |
| Already signed on (club) | /sign-on-club | 400 | 402 | 2025-10-03 | 2026-02-28 | 148d | ACTIVE |
| Time validation | /patrol-log-members | 400 | 401 | 2025-10-04 | 2026-01-21 | 109d | FIXED |
| Already signed off (LG) | /sign-off-lifeguard | 400 | 287 | 2025-10-02 | 2026-02-28 | 149d | ACTIVE |
| Invalid unitId | /unit-status-details | 400 | 235 | 2025-11-01 | 2026-02-27 | 118d | ACTIVE |
| Location too long | /incident | 400 | 218 | 2025-11-04 | 2026-02-16 | 104d | ACTIVE |
| Invalid token (SSO verify) | /sso/verify | 404 | 186 | 2025-10-01 | 2025-11-18 | 48d | GONE |
| 524 Cloudflare Timeout | /patrol-log/create | 524 | 168 | 2026-01-10 | 2026-02-21 | 42d | NEW |
| Duplicate patrol time | /patrol-team | 400 | 165 | 2025-12-19 | 2026-02-22 | 65d | NEW |
| Undefined injury nature | /incident | 400 | 162 | 2025-11-01 | 2026-02-26 | 117d | BUG |
| Invalid beach combo | /sign-on-club | 400 | 38 | 2025-12-06 | 2025-12-14 | 8d | GONE |
| 524 Cloudflare on patrol-team | /patrol-team | 524 | 16 | 2026-01-10 | 2026-01-10 | 1d | 1-DAY |
| TDS Protocol Error | /sso/auth | 500 | 22 | 2025-11-26 | 2025-11-26 | 1d | 1-DAY |
| DB Unavailable (SLS-HUB-PROD) | /sso/auth | 500 | 1 | 2025-12-07 | 2025-12-07 | 1d | 1-DAY |
| Server Shutdown | /radio-logs | 500 | 1 | 2025-11-10 | 2025-11-10 | 1d | 1-DAY |
Client sent invalid data. Most are preventable with frontend validation.
| Endpoint | Actual API Response Body | Count | Fix |
|---|---|---|---|
/patrol-log-members | {"statusCode":400,"message":"Maximum patrol position has reached its limit: 3"} | 669 | Enforce limit in UI |
/sign-on-lifeguard | {"statusCode":400,"message":["Primary contact must be one of: Radio, Mobile, Landline, SMR, or None.","Primary contact is required."]} | 548 | Fix dropdown |
/patrol-log-members | {"statusCode":400,"message":"This member is currently signed on to another patrol (XXXXX)"} | 519 | Check before add |
/sign-on-club | {"statusCode":400,"message":"Club is already signed on. Please sign off."} | 402 | Check state |
/patrol-log-members | {"statusCode":400,"message":["FinishTime is either missing or not after StartTime","StartTime is either missing or not before FinishTime"]} | 401 | Validate times |
/sign-off-lifeguard | {"statusCode":400,"message":"ServiceId is already Signed Off."} | 287 | Disable re-click |
/sign-off-club | {"statusCode":400,"message":"ServiceId is already Signed Off."} | 239 | Disable re-click |
/unit-status-details | {"statusCode":400,"message":"Invalid unitId not type Unit"} | 235 | Validate unit type |
/incident | {"statusCode":400,"message":["location up to 50 characters only"]} | 218 | maxlength=50 |
/incident | {"statusCode":400,"message":"undefined is invalid injury nature id"} | 162 | JS BUG |
/sign-on-support | {"statusCode":400,"message":["Odometer must be a valid integer."]} | 114 | Validate number |
/patrol-log-members | {"statusCode":400,"message":"Member ID XXXXX does not exist for substitution."} | 111 | Validate member |
/sign-on-lifeguard | {"statusCode":400,"message":"Lifeguard is already signed on. Please sign off."} | 77 | Check state |
/incident | {"statusCode":400,"message":["firstSightedBy up to 60 characters only"]} | 65 | maxlength=60 |
/update-support | {"statusCode":400,"message":"Arrival time must be after the current datetime"} | 65 | Validate datetime |
/radio-logs | {"statusCode":400,"message":["orgIds must be a number, comma-separated integers"]} | 59 | Fix param format |
/patrol-log-members | {"statusCode":400,"message":"Maximum patrol position has reached its limit: 1"} | 54 | Enforce in UI |
/rescues/add | {"statusCode":400,"message":["postalPostCode must be an integer","must be a positive number"]} | 52 | Validate postcode |
/sign-off-support | {"statusCode":400,"message":["Current location is a mandatory field."]} | 40 | Require field |
/incident | {"statusCode":400,"message":["rescuer up to 20 characters only"]} | 39 | maxlength=20 |
/sign-on-club | {"statusCode":400,"message":"Invalid Service Club Beach combination"} | 38 | Validate beach |
/incident | {"statusCode":400,"message":["position up to 30 characters only"]} | 38 | maxlength=30 |
/incident | {"statusCode":400,"message":["vicAge must be a positive number"]} | 37 | Validate number |
/sign-off-unit | {"statusCode":400,"message":"Unit is already Sign Off"} | 36 | Disable re-click |
/patrol-log-members | {"statusCode":400,"message":"Maximum patrol position has reached its limit: 4"} | 32 | Enforce in UI |
/sign-on-unit | {"statusCode":400,"message":"Unit is already signed on. Please sign off."} | 30 | Check state |
/patrol-team | {"statusCode":400,"message":["FinishTime is either missing or not after StartTime"]} | 26 | Validate times |
/patrol-team | {"statusCode":400,"message":"There is already a Patrol Group date with the time you selected."} | 21 | Check conflicts |
/sso/auth | {"statusCode":403,"message":"The password you entered is too long"} | 20 | Add maxlength |
/update-lifeguard | {"statusCode":400,"message":"ServiceId is already Signed Off. Cannot Update"} | 17 | Check state |
| Endpoint | Actual API Response Body | Count |
|---|---|---|
/patrol-log/create | {"statusCode":403,"message":"Patrol Log already exists for this Patrol Group Date. (Patrol Log ID: XXXXX)"} | 1,587 |
/sso/auth | {"statusCode":403,"message":"Please login to the SLS Hub and change your password."} | 586 |
/patrol-log/check-exists | {"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"} | 282 |
/patrol-log/create | {"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"} | 148 |
/sso/auth | {"statusCode":403,"message":"The password you entered is too long"} | 20 |
| Endpoint | Actual API Response Body | Count | Expected? |
|---|---|---|---|
/sso/auth | {"statusCode":404,"message":"We couldn't find an account with the Username and Password provided"} | 15,969 | Yes — wrong password |
/service-status-details | {"statusCode":404,"message":"No available logClubs found"} | 4,488 | Yes — not signed on |
/patrol-log-members | {"statusCode":404,"message":"Cannot find MemberID: XXXXX"} | 275 | No — stale data |
/patrol-log-members | {"statusCode":404,"message":"Member ID XXXXX was not found in the patrol log with ID XXXXX."} | 261 | No — sync issue |
/patrol-log-members | {"statusCode":404,"message":"Member ID XXXXX was not found in patrol log XXXXX under position ID XXXXX."} | 222 | No — sync issue |
/sso/verify | {"statusCode":404,"message":"Client not found or invalid access token"} | 186 | Partial — token expiry |
/service-status-details | {"statusCode":404,"message":"No available log lifeguards found"} | 56 | Yes — not signed on |
| Code | Endpoint | Actual Response Body | Count |
|---|---|---|---|
| 500 Internal Server Error — SLSA code bugs & DB issues (66 total) | |||
| 500 | /sso/auth | {"message":"Internal server error3 - Error: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 3 (\"@0\"): Data type 0xE7 has an invalid data length"} | 22 |
| 500 | /incident/all | {"message":"Internal server error3 - ER_CON_COUNT_ERROR: Too many connections"} | 17 |
| 500 | /sign-off-lifeguard | {"message":"Internal server error3 - connect ETIMEDOUT"} | 4 |
| 500 | /sign-off-club | {"message":"Internal server error3 - Cannot read properties of undefined (reading 'PatrolDate')"} | 2 |
| 500 | /update-lifeguard | {"message":"Internal server error3 - Cannot read properties of null (reading 'beach_name')"} | 2 |
| 500 | /sso/auth | {"message":"Internal server error3 - Database 'SLS-HUB-PROD' on server 'sls-p-apidb-01' is not currently available."} | 1 |
| 500 | /radio-logs | {"message":"Internal server error3 - ER_SERVER_SHUTDOWN: Server shutdown in progress"} | 1 |
| 502 Bad Gateway — Upstream not responding (741 total, 17 endpoints) | |||
| 502 | /patrol-log-members | "error code: 502" | 331 |
| 502 | /patrol-log/create | "error code: 502" | 100 |
| 502 | /radio-logs | "error code: 502" | 68 |
| 502 | /patrol-log-stats/update | "error code: 502" | 41 |
| 502 | /gear/entity-assets | "error code: 502" | 38 |
| 502 | /sso/auth | "error code: 502" | 30 |
| 502 | + 11 more endpoints (incident/all 20, units-for-entity 18, season-dates 16, sign-on-club 14, update-club 13, sign-off-club 9, ...) | 113 | |
| 503 App Crash — Azure returns HTML error page (322 total, 20 endpoints) | |||
| 503 | <h1 style="color: 747474">:( Application Error</h1><p>If you are the application administrator, you can access the <a href="https://slsaapi.scm.azurewebsites.net/detectors">diagnostic resources</a>.</p> | 322 | |
| Top affected: /patrol-log-members (104), /radio-logs (41), /patrol-log/create (39), /sign-on-club (30), /service-status-details (30), /sso/auth (18), /patrol-log/check-exists (15), /patrol-log-stats (14), /unit-status-details (7), + 11 more | |||
| 524 Cloudflare Timeout (226 total) | |||
| 524 | /patrol-log/create | "error code: 524" | 168 |
| 524 | /patrol-log-members | "error code: 524" | 33 |
| 524 | /patrol-team | "error code: 524" | 16 |
| 524 | + /sso/auth (4), /patrols/roster (3), /incident/all (1), /radio-logs (1) | 9 | |
slsaapi.scm.azurewebsites.net).
"Too many connections" (17x) = MySQL pool exhaustion.
TDS protocol error (22x) = SQL Server param type mismatch.
"Server shutdown in progress" (1x) = DB restart during query.
"SLS-HUB-PROD not available" (1x) = Production DB outage.
Null reference bugs: reading 'PatrolDate' and reading 'beach_name' = JS null pointer in SLSA code.
404 — Invalid credentials (15,969): Users entering wrong username/password. This is expected user-input error, not a system bug. Rate is stable at ~40% of auth attempts.
403 — Password change required (586): SLSA Hub requires password update. Users redirected to hub.sls.com.au.
404 — Token expired (186): SSO verify fails when token has expired between auth and verify calls.
Impact: Users see "login failed" — no data loss. But high volume (12K/month) suggests UX could guide users better.
Recommendation: Add "Forgot password?" link, rate limit after 5 failures, show SLS Hub redirect for 403 errors.
400 — Position limit reached (755): Attempting to add more members than the position allows (e.g., max 3 IRB Drivers). Frontend doesn't enforce limits.
400 — Time validation (401): FinishTime missing or before StartTime. Mostly Clifton Beach Dec 2025 (resolved).
502/503/524 — Server errors (468): SLSA infrastructure failures. 502 Bad Gateway (331), 503 Application Error (104), 524 Cloudflare timeout (33).
Impact: Member not synced to SLSA central. Local roster shows the member but central doesn't know.
Recommendation:
403 — Duplicate log / invalid ID (~900): "Patrol Log already exists" — expected duplicate detection. Also "Cannot find PatrolGroupDateID" when patrol doesn't exist in central system.
502/524 — Server timeouts (~268): SLSA server can't process the request in time. Avg response 1.99s, max 126s.
503 — Application crash (39): Full SLSA app error page returned instead of JSON.
Trend: 91.5% (Nov) → 84.3% (Feb) → 80.6% (Mar)
This is the most concerning endpoint. Success rate has dropped 11 percentage points over 5 months.
Recommendation:
Lifeguard sign-on (635 errors, 88.2%): Primary contact validation (548), already signed on (77).
Club sign-on (484, 95.0%): Already signed on (402), invalid beach combo (38).
Sign-off club/lifeguard (560): "Already Signed Off" — double sign-off attempts.
Root cause: The app doesn't check the current sign-on/off state before submitting to the API. Users can click sign-on twice, or try to sign off a patrol that's already signed off.
Recommendation:
Field length violations:
Root cause: Frontend text inputs don't enforce API field length limits. The "undefined injury nature id" error (162 occurrences) is likely a JavaScript bug sending undefined values.
Recommendation:
502 Bad Gateway (331 on /patrol-log-members, 100 on /patrol-log/create, 68 on /radio-logs, others): SLSA upstream server not responding. Proxy returns generic 502.
524 Cloudflare Timeout (168 on /patrol-log/create, 33 on /patrol-log-members): Request exceeded Cloudflare's timeout threshold.
503 Application Error (104 on /patrol-log-members, 41 on /radio-logs, 39 on /patrol-log/create): Full HTML error page returned instead of JSON — SLSA app crash.
Impact: These are SLSA server-side issues, not Operations App bugs. However, the app should handle them gracefully.
Monthly 5xx trend:
Recommendation: Implement retry with backoff for 5xx. Parse HTML error pages to detect app crashes vs gateway errors. Report 5xx rates to SLSA API team.
| Root Cause | 6-Month Total | % of Errors | Actionable? | Who Fixes? |
|---|---|---|---|---|
| Failed logins (wrong credentials) | 15,969 | 44.0% | Expected | User education / UX |
| Unsigned patrol status check (404) | 4,544 | 12.5% | Expected | Handle 404 as "not signed on" |
| SLSA server errors (502/503/524) | 1,089 | 3.0% | Infra | SLSA API team |
| Duplicate state (already signed on/off) | 1,005 | 2.8% | Check First | Nano — pre-check state |
| Patrol log duplicates/not found | 900 | 2.5% | Handle | Nano — handle 403 gracefully |
| Position limits exceeded | 755 | 2.1% | Prevent | Nano — frontend validation |
| Password change required | 586 | 1.6% | Expected | User redirect to SLS Hub |
| Validation failures (field length etc) | 559 | 1.5% | Prevent | Nano — input validation |
| Lifeguard primary contact validation | 548 | 1.5% | Prevent | Nano — fix form field |
| Time validation errors | 401 | 1.1% | Fixed | Resolved (Clifton Beach) |
| TOTAL ERRORS | 36,281 | 100% | 56.5% expected/user-error • 43.5% actionable | |
Between 9–11pm on Jan 10, the SLSA API buckled — 63 Cloudflare timeouts on /patrol-log/create, then 502s cascaded to /patrol-log-members and /radio-logs. 135 unique users affected. This was a precursor to the larger Jan 17-18 incident.
Recommendation: Circuit breaker — if 5xx rate exceeds 5% in 5 minutes, queue requests and retry with backoff.
The top 15 users account for 885 errors (28% of user-triggered errors). These are patrol captains retrying the same failed action — the app doesn't explain what went wrong, so they click again hoping it'll work.
Recommendation: After 3 consecutive errors, show contextual help. Track error-per-user rates for proactive club outreach.
/patrol-log/create/patrol-team/patrol-log-membersRecommendation: Client-side timeout at 10s with "Still working..." message. Validate before sending to catch 400s at 0ms.
| Strategy | Errors Prevented | % | Effort | Priority |
|---|---|---|---|---|
| Handle duplicates gracefully | 1,587 | 4.4% | Small | HIGH |
| Retry 5xx with backoff | 1,312 | 3.6% | Medium | HIGH |
| Check state before API call | 1,119 | 3.1% | Small | HIGH |
| Frontend limit enforcement | 755 | 2.1% | Small | MED |
| Input validation (maxlength) | 721 | 2.0% | Small | MED |
| Fix primary contact dropdown | 548 | 1.5% | Tiny | MED |
| Fix undefined injury JS bug | 162 | 0.4% | Tiny | HIGH |
| TOTAL PREVENTABLE | 6,204 | 16.7% | Rest: expected 56.5% + SLSA infra 3.6% | |
Double-clicks, position limits, timeouts. Fix: Disable button after click, show position counts, loading timeout message.
"Already signed off" double-clicks, missing location. Fix: Disable after click, auto-populate location from GPS.
1,089 server errors (3%) from SLSA's Azure API. Key issues: 503 app crashes return HTML (detect and handle), 502 bad gateway (retry twice), 524 Cloudflare timeouts (client-side timeout at 15s), "Too many connections" (17×, MySQL pool), "SLS-HUB-PROD not available" (1×, full DB outage).
Recommendation: Monthly infra report to SLSA API team. Advocate for: higher connection pool, Azure scaling rules, Cloudflare timeout increase.