Surf Life Saving

SLSA API Error Analysis Dashboard

6-Month Report — October 2025 to March 2026

IT Consulting Audit Nano
Period:
Total API Calls
--
Total Errors
--
Success Rate
--
Position Error Rate
--
API Traffic & Error Rate
Position Error Rate Trend
HTTP Response Codes
Error Categories
Problem Endpoints
EndpointTotal CallsErrorsSuccess RateAvg Response (s)Monthly Trend (Oct–Mar)
Top Affected Clubs
#ClubStateErrorsFirst ErrorLast Error
Key Incidents

Dec 2025: Clifton Beach

240 time validation errors from a single entity (Clifton Beach Lifeguard). Anomalous spike caused by misconfigured patrol time submissions that triggered validation failures repeatedly.

Resolved by January 2026

Jan 17–18: Database Performance

Unindexed database queries caused server overload during peak patrol period (816 + 827 concurrent patrols). Cascading 502/503 errors across multiple endpoints until emergency indexing was applied.

Resolved in v2.13.0
Recommendations
PriorityAction ItemExpected Impact
HIGH Continue database query optimization and indexing strategy. Monitor slow query log and add indexes proactively for new features. Prevent repeat of Jan 17-18 outage pattern
HIGH Implement rate limiting and circuit breakers for /sso/auth endpoint (59.2% success rate is critically low). Reduce auth cascade failures, improve user experience
HIGH Add automated alerting for error rate spikes above 3% on any endpoint within a 15-minute window. Faster incident detection and response times
MED Review /patrol-log/create validation logic. Success rate declining from 91.5% to 80.6% over 6 months. Arrest declining patrol log creation reliability
MED Implement client-side validation for "signed on elsewhere" and "duplicate patrol log" errors to reduce unnecessary API calls. Reduce 40% of position-related errors at source
MED Add structured error tracking per club to enable proactive outreach for clubs with persistent issues. Reduce support burden, improve club satisfaction
LOW Investigate 502/524 timeout errors — consider increasing upstream timeouts or adding request queuing for heavy operations. Reduce intermittent gateway errors during peak load
LOW Implement API versioning strategy to support backward-compatible changes without breaking existing integrations. Safer deployments, reduced regression risk
When Do Errors Happen? — Time & Day Patterns

Errors by Hour of Day (AEST)

Error rate per hour across all 6 months. Note: times are server time (AEST/Australia Sydney).

HourCallsErrorsRate 7am19K7804.03%
8am316K5250.17%
9am179K4610.26%
10am7K2693.67%
11am6K4857.55%
12pm2K30919.4%
1pm1K505.36%
2pm276K210.01%
3pm14K110.08%
7pm91K4,8575.34%
8pm15K8185.50%
9pm73K3,2804.47%
10pm98K4,5034.58%
11pm49K2,0244.14%
Two error peaks: 11am–12pm (sign-off rush, up to 19.4% error rate) and 9–10pm (sign-on rush + cron sync window, 4.5% error rate). Best performance at 8am and 2pm (<0.2% rate) — high-volume cron syncs with clean data.

Errors by Day of Week

Total API errors per day of week across all 6 months.

Saturday10,310
Sunday7,974
Friday4,556
Thursday2,565
Monday2,323
Wednesday2,171
Tuesday1,739
Saturday has 6x more errors than Tuesday. Weekends are peak patrol days — more sign-ons = more errors. This is proportional to volume, not a system issue.

Sign-On vs Sign-Off Error Timing — The Two Rush Hours

When do sign-on and sign-off errors occur? This reveals the operational rhythm.

Sign-On Errors (Peak: 9–11pm AEST)

9pm
249 10pm
424 11pm
114 12am
78 1am
147 2am
139
Why 9–11pm? Patrol captains sign on for the next day's roster in the evening. The cron sync at 8:45pm (staging) / 5:30am (prod) creates new patrol logs, and captains immediately try to sign on — collision with sync window.

Sign-Off Errors (Peak: 11am–12pm AEST)

5am
43 6am
82 7am
67 11am
311 10am
22
Why 11am? Morning patrol shift (6am–11am) ends. Patrol captains all sign off around 11am–12pm. Common errors: "ServiceId is already Signed Off" (double-click) and "Current location is a mandatory field" (missing data).
The 7pm spike (4,857 errors, 5.34% rate) is the cron sync window — api:patrols, api:entities, api:members all run between 7–9pm. The /service-status-details endpoint alone accounts for 4,125 of the 4,857 errors at 7pm, all returning "No available logClubs found" (404) — expected for patrols that haven't been signed on yet.
Error Source — Cron Jobs vs User Actions

Who Triggers the Errors? (January 2026 sample)

Using user_id presence to distinguish user-triggered vs system/cron errors.

66%
Cron / System
6,336 errors — no user_id
34%
User-Triggered
3,203 errors — has user_id
Cron errors (66%): Mostly expected — failed logins during member sync (4,602), unsigned patrol status checks (1,328), expired tokens. Users never see these errors.
User errors (34%): Directly impact the user experience — position limits (886), patrol log duplicates (637), sign-on validation (292+85), incidents (235). These need better error messages.

What Users See vs What Runs Silently

SourceEndpointCodeErrorsUser Impact
User-Triggered — Users see these errors
User/patrol-log-members400886Position limit / signed elsewhere / time error
User/patrol-log/create403637"Patrol log already exists" — confusing for user
User/sign-on-lifeguard400292"Primary contact required" — form incomplete
User/incident400235Field too long / undefined injury — form error
User/patrol-log-members404223"Member not found" — stale data shown to user
User/patrol-team400132Time conflict / duplicate patrol time
User/patrol-log/create524100Timeout — user sees spinner then generic error
User/sign-on-club40085"Already signed on" — double-click
Cron/System — Users never see these
Cron/sso/auth4044,602Silent — member sync login attempts
Cron/service-status-details4041,328Silent — checking unsigned patrols
Cron/sso/auth403150Silent — password change needed
Cron/sign-off-lifeguard400114Silent — auto sign-off attempts
Cron/patrol-log/check-exists40347Silent — patrol log validation
Error Timeline — First & Last Spotted

When each error pattern was first and last observed. Span shows how long the issue has persisted. Green = resolved, Red = still active.

Error PatternEndpointCodeCountFirst SeenLast SeenSpanStatus
Position limit: 3/patrol-log-members4006692025-10-032026-02-28148dACTIVE
Primary contact validation/sign-on-lifeguard4005482025-09-302026-02-26149dACTIVE
Signed on elsewhere/patrol-log-members4005192025-09-302026-02-28151dACTIVE
Already signed on (club)/sign-on-club4004022025-10-032026-02-28148dACTIVE
Time validation/patrol-log-members4004012025-10-042026-01-21109dFIXED
Already signed off (LG)/sign-off-lifeguard4002872025-10-022026-02-28149dACTIVE
Invalid unitId/unit-status-details4002352025-11-012026-02-27118dACTIVE
Location too long/incident4002182025-11-042026-02-16104dACTIVE
Invalid token (SSO verify)/sso/verify4041862025-10-012025-11-1848dGONE
524 Cloudflare Timeout/patrol-log/create5241682026-01-102026-02-2142dNEW
Duplicate patrol time/patrol-team4001652025-12-192026-02-2265dNEW
Undefined injury nature/incident4001622025-11-012026-02-26117dBUG
Invalid beach combo/sign-on-club400382025-12-062025-12-148dGONE
524 Cloudflare on patrol-team/patrol-team524162026-01-102026-01-101d1-DAY
TDS Protocol Error/sso/auth500222025-11-262025-11-261d1-DAY
DB Unavailable (SLS-HUB-PROD)/sso/auth50012025-12-072025-12-071d1-DAY
Server Shutdown/radio-logs50012025-11-102025-11-101d1-DAY
Key observations: Time validation fixed after Jan 21 (Clifton Beach resolved). 524 Cloudflare timeouts appeared Jan 10 — new issue, still active. TDS Protocol, DB unavailable, Server Shutdown — single-day incidents on SLSA side. Undefined injury nature — persistent JS bug active for 117 days, needs fix. Invalid beach combo — appeared Dec 6, gone by Dec 14 (8-day issue, likely config fix).
Complete Error Catalog — Every API Error by HTTP Code

400 Bad Request — Validation Failures (6,058 total across 6 months)

Client sent invalid data. Most are preventable with frontend validation.

EndpointActual API Response BodyCountFix
/patrol-log-members{"statusCode":400,"message":"Maximum patrol position has reached its limit: 3"}669Enforce limit in UI
/sign-on-lifeguard{"statusCode":400,"message":["Primary contact must be one of: Radio, Mobile, Landline, SMR, or None.","Primary contact is required."]}548Fix dropdown
/patrol-log-members{"statusCode":400,"message":"This member is currently signed on to another patrol (XXXXX)"}519Check before add
/sign-on-club{"statusCode":400,"message":"Club is already signed on. Please sign off."}402Check state
/patrol-log-members{"statusCode":400,"message":["FinishTime is either missing or not after StartTime","StartTime is either missing or not before FinishTime"]}401Validate times
/sign-off-lifeguard{"statusCode":400,"message":"ServiceId is already Signed Off."}287Disable re-click
/sign-off-club{"statusCode":400,"message":"ServiceId is already Signed Off."}239Disable re-click
/unit-status-details{"statusCode":400,"message":"Invalid unitId not type Unit"}235Validate unit type
/incident{"statusCode":400,"message":["location up to 50 characters only"]}218maxlength=50
/incident{"statusCode":400,"message":"undefined is invalid injury nature id"}162JS BUG
/sign-on-support{"statusCode":400,"message":["Odometer must be a valid integer."]}114Validate number
/patrol-log-members{"statusCode":400,"message":"Member ID XXXXX does not exist for substitution."}111Validate member
/sign-on-lifeguard{"statusCode":400,"message":"Lifeguard is already signed on. Please sign off."}77Check state
/incident{"statusCode":400,"message":["firstSightedBy up to 60 characters only"]}65maxlength=60
/update-support{"statusCode":400,"message":"Arrival time must be after the current datetime"}65Validate datetime
/radio-logs{"statusCode":400,"message":["orgIds must be a number, comma-separated integers"]}59Fix param format
/patrol-log-members{"statusCode":400,"message":"Maximum patrol position has reached its limit: 1"}54Enforce in UI
/rescues/add{"statusCode":400,"message":["postalPostCode must be an integer","must be a positive number"]}52Validate postcode
/sign-off-support{"statusCode":400,"message":["Current location is a mandatory field."]}40Require field
/incident{"statusCode":400,"message":["rescuer up to 20 characters only"]}39maxlength=20
/sign-on-club{"statusCode":400,"message":"Invalid Service Club Beach combination"}38Validate beach
/incident{"statusCode":400,"message":["position up to 30 characters only"]}38maxlength=30
/incident{"statusCode":400,"message":["vicAge must be a positive number"]}37Validate number
/sign-off-unit{"statusCode":400,"message":"Unit is already Sign Off"}36Disable re-click
/patrol-log-members{"statusCode":400,"message":"Maximum patrol position has reached its limit: 4"}32Enforce in UI
/sign-on-unit{"statusCode":400,"message":"Unit is already signed on. Please sign off."}30Check state
/patrol-team{"statusCode":400,"message":["FinishTime is either missing or not after StartTime"]}26Validate times
/patrol-team{"statusCode":400,"message":"There is already a Patrol Group date with the time you selected."}21Check conflicts
/sso/auth{"statusCode":403,"message":"The password you entered is too long"}20Add maxlength
/update-lifeguard{"statusCode":400,"message":"ServiceId is already Signed Off. Cannot Update"}17Check state

403 Forbidden — Access Denied / Duplicates (2,823 total)

EndpointActual API Response BodyCount
/patrol-log/create{"statusCode":403,"message":"Patrol Log already exists for this Patrol Group Date. (Patrol Log ID: XXXXX)"}1,587
/sso/auth{"statusCode":403,"message":"Please login to the SLS Hub and change your password."}586
/patrol-log/check-exists{"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"}282
/patrol-log/create{"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"}148
/sso/auth{"statusCode":403,"message":"The password you entered is too long"}20

404 Not Found — Resource Missing (21,721 total)

EndpointActual API Response BodyCountExpected?
/sso/auth{"statusCode":404,"message":"We couldn't find an account with the Username and Password provided"}15,969Yes — wrong password
/service-status-details{"statusCode":404,"message":"No available logClubs found"}4,488Yes — not signed on
/patrol-log-members{"statusCode":404,"message":"Cannot find MemberID: XXXXX"}275No — stale data
/patrol-log-members{"statusCode":404,"message":"Member ID XXXXX was not found in the patrol log with ID XXXXX."}261No — sync issue
/patrol-log-members{"statusCode":404,"message":"Member ID XXXXX was not found in patrol log XXXXX under position ID XXXXX."}222No — sync issue
/sso/verify{"statusCode":404,"message":"Client not found or invalid access token"}186Partial — token expiry
/service-status-details{"statusCode":404,"message":"No available log lifeguards found"}56Yes — not signed on

5xx Server Errors — SLSA Infrastructure (1,089 total — Azure: slsaapi.scm.azurewebsites.net)

CodeEndpointActual Response BodyCount
500 Internal Server Error — SLSA code bugs & DB issues (66 total)
500/sso/auth{"message":"Internal server error3 - Error: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 3 (\"@0\"): Data type 0xE7 has an invalid data length"}22
500/incident/all{"message":"Internal server error3 - ER_CON_COUNT_ERROR: Too many connections"}17
500/sign-off-lifeguard{"message":"Internal server error3 - connect ETIMEDOUT"}4
500/sign-off-club{"message":"Internal server error3 - Cannot read properties of undefined (reading 'PatrolDate')"}2
500/update-lifeguard{"message":"Internal server error3 - Cannot read properties of null (reading 'beach_name')"}2
500/sso/auth{"message":"Internal server error3 - Database 'SLS-HUB-PROD' on server 'sls-p-apidb-01' is not currently available."}1
500/radio-logs{"message":"Internal server error3 - ER_SERVER_SHUTDOWN: Server shutdown in progress"}1
502 Bad Gateway — Upstream not responding (741 total, 17 endpoints)
502/patrol-log-members"error code: 502"331
502/patrol-log/create"error code: 502"100
502/radio-logs"error code: 502"68
502/patrol-log-stats/update"error code: 502"41
502/gear/entity-assets"error code: 502"38
502/sso/auth"error code: 502"30
502+ 11 more endpoints (incident/all 20, units-for-entity 18, season-dates 16, sign-on-club 14, update-club 13, sign-off-club 9, ...)113
503 App Crash — Azure returns HTML error page (322 total, 20 endpoints)
503<h1 style="color: 747474">:( Application Error</h1><p>If you are the application administrator, you can access the <a href="https://slsaapi.scm.azurewebsites.net/detectors">diagnostic resources</a>.</p>322
Top affected: /patrol-log-members (104), /radio-logs (41), /patrol-log/create (39), /sign-on-club (30), /service-status-details (30), /sso/auth (18), /patrol-log/check-exists (15), /patrol-log-stats (14), /unit-status-details (7), + 11 more
524 Cloudflare Timeout (226 total)
524/patrol-log/create"error code: 524"168
524/patrol-log-members"error code: 524"33
524/patrol-team"error code: 524"16
524+ /sso/auth (4), /patrols/roster (3), /incident/all (1), /radio-logs (1)9
5xx Key Findings: SLSA API runs on Azure App Service (crash page leaks slsaapi.scm.azurewebsites.net). "Too many connections" (17x) = MySQL pool exhaustion. TDS protocol error (22x) = SQL Server param type mismatch. "Server shutdown in progress" (1x) = DB restart during query. "SLS-HUB-PROD not available" (1x) = Production DB outage. Null reference bugs: reading 'PatrolDate' and reading 'beach_name' = JS null pointer in SLSA code.
Root Cause Analysis by Error Type

Authentication Errors — 16,741 events (46% of all errors)

404 — Invalid credentials (15,969): Users entering wrong username/password. This is expected user-input error, not a system bug. Rate is stable at ~40% of auth attempts.

403 — Password change required (586): SLSA Hub requires password update. Users redirected to hub.sls.com.au.

404 — Token expired (186): SSO verify fails when token has expired between auth and verify calls.

Impact: Users see "login failed" — no data loss. But high volume (12K/month) suggests UX could guide users better.

Recommendation: Add "Forgot password?" link, rate limit after 5 failures, show SLS Hub redirect for 403 errors.

Patrol Position Errors — 4,197 events on /patrol-log-members

400 — Position limit reached (755): Attempting to add more members than the position allows (e.g., max 3 IRB Drivers). Frontend doesn't enforce limits.

400 — Time validation (401): FinishTime missing or before StartTime. Mostly Clifton Beach Dec 2025 (resolved).

502/503/524 — Server errors (468): SLSA infrastructure failures. 502 Bad Gateway (331), 503 Application Error (104), 524 Cloudflare timeout (33).

Impact: Member not synced to SLSA central. Local roster shows the member but central doesn't know.

Recommendation:

  • Fetch position limits and validate in frontend
  • Add retry with exponential backoff for 5xx errors
  • Validate time fields before API submission
  • Show specific error message to user, not generic failure

Patrol Log Create — 2,042 errors (87.4% success, DEGRADING)

403 — Duplicate log / invalid ID (~900): "Patrol Log already exists" — expected duplicate detection. Also "Cannot find PatrolGroupDateID" when patrol doesn't exist in central system.

502/524 — Server timeouts (~268): SLSA server can't process the request in time. Avg response 1.99s, max 126s.

503 — Application crash (39): Full SLSA app error page returned instead of JSON.

Trend: 91.5% (Nov) → 84.3% (Feb) → 80.6% (Mar)

This is the most concerning endpoint. Success rate has dropped 11 percentage points over 5 months.

Recommendation:

  • Handle 403 duplicates gracefully (extract log ID from error)
  • Implement queue-based retry for 5xx errors
  • Escalate timeout issue to SLSA V2 API team

Sign-On / Sign-Off Errors — 1,663 events across 6 endpoints

Lifeguard sign-on (635 errors, 88.2%): Primary contact validation (548), already signed on (77).

Club sign-on (484, 95.0%): Already signed on (402), invalid beach combo (38).

Sign-off club/lifeguard (560): "Already Signed Off" — double sign-off attempts.

Root cause: The app doesn't check the current sign-on/off state before submitting to the API. Users can click sign-on twice, or try to sign off a patrol that's already signed off.

Recommendation:

  • Check status before sign-on/off API call
  • Disable button after first click (prevent double-submit)
  • Fix primary contact field validation for lifeguards

Incident Reporting Errors — 620 events (94.8% success)

Field length violations:

  • "location up to 50 characters only" (218)
  • "undefined is invalid injury nature id" (162) — likely a bug
  • "firstSightedBy up to 60 characters only" (65)
  • "rescuer up to 20 characters only" (39)
  • "vicAge must be a positive number" (37)
  • "position up to 30 characters only" (38)

Root cause: Frontend text inputs don't enforce API field length limits. The "undefined injury nature id" error (162 occurrences) is likely a JavaScript bug sending undefined values.

Recommendation:

  • Add maxlength to all input fields matching API limits
  • Fix the undefined injury_nature_id bug
  • Validate numeric fields (vicAge, postalPostCode) before submit

Infrastructure Errors (5xx) — 1,089 events across all endpoints

502 Bad Gateway (331 on /patrol-log-members, 100 on /patrol-log/create, 68 on /radio-logs, others): SLSA upstream server not responding. Proxy returns generic 502.

524 Cloudflare Timeout (168 on /patrol-log/create, 33 on /patrol-log-members): Request exceeded Cloudflare's timeout threshold.

503 Application Error (104 on /patrol-log-members, 41 on /radio-logs, 39 on /patrol-log/create): Full HTML error page returned instead of JSON — SLSA app crash.

Impact: These are SLSA server-side issues, not Operations App bugs. However, the app should handle them gracefully.

Monthly 5xx trend:

Oct: 43 • Nov: 176 • Dec: 514 • Jan: 362 • Feb: 261

Recommendation: Implement retry with backoff for 5xx. Parse HTML error pages to detect app crashes vs gateway errors. Report 5xx rates to SLSA API team.

Summary — Error Distribution by Root Cause
Root Cause6-Month Total% of ErrorsActionable?Who Fixes?
Failed logins (wrong credentials)15,96944.0%ExpectedUser education / UX
Unsigned patrol status check (404)4,54412.5%ExpectedHandle 404 as "not signed on"
SLSA server errors (502/503/524)1,0893.0%InfraSLSA API team
Duplicate state (already signed on/off)1,0052.8%Check FirstNano — pre-check state
Patrol log duplicates/not found9002.5%HandleNano — handle 403 gracefully
Position limits exceeded7552.1%PreventNano — frontend validation
Password change required5861.6%ExpectedUser redirect to SLS Hub
Validation failures (field length etc)5591.5%PreventNano — input validation
Lifeguard primary contact validation5481.5%PreventNano — fix form field
Time validation errors4011.1%FixedResolved (Clifton Beach)
TOTAL ERRORS36,281100%56.5% expected/user-error • 43.5% actionable
Bottom line: Of 36,281 total errors, 56.5% are expected (failed logins, unsigned patrols). The remaining 43.5% (15,800 errors) are actionable — split between Nano fixes (frontend validation, state checking) and SLSA infrastructure issues (5xx errors).
Deep Insights — What the Data Tells Us

Finding 1: January 10 Cascade Failure — 167 Server Errors in 2 Hours

Between 9–11pm on Jan 10, the SLSA API buckled — 63 Cloudflare timeouts on /patrol-log/create, then 502s cascaded to /patrol-log-members and /radio-logs. 135 unique users affected. This was a precursor to the larger Jan 17-18 incident.

Recommendation: Circuit breaker — if 5xx rate exceeds 5% in 5 minutes, queue requests and retry with backoff.

Finding 2: "Repeat Offender" Users — Top User Hit 246 Errors in 13 Days

The top 15 users account for 885 errors (28% of user-triggered errors). These are patrol captains retrying the same failed action — the app doesn't explain what went wrong, so they click again hoping it'll work.

Recommendation: After 3 consecutive errors, show contextual help. Track error-per-user rates for proactive club outreach.

Finding 3: Errors Are Slow — Users Wait 20 Seconds for Nothing

/patrol-log/create
19.5s error vs 0.4s success
49x slower
/patrol-team
13.9s error vs 0.3s success
46x slower
/patrol-log-members
0.7s error vs 0.2s success
3.5x slower

Recommendation: Client-side timeout at 10s with "Still working..." message. Validate before sending to catch 400s at 0ms.

Finding 4: Prevention Roadmap — 16.7% Reduction With Small Effort

StrategyErrors Prevented%EffortPriority
Handle duplicates gracefully1,5874.4%SmallHIGH
Retry 5xx with backoff1,3123.6%MediumHIGH
Check state before API call1,1193.1%SmallHIGH
Frontend limit enforcement7552.1%SmallMED
Input validation (maxlength)7212.0%SmallMED
Fix primary contact dropdown5481.5%TinyMED
Fix undefined injury JS bug1620.4%TinyHIGH
TOTAL PREVENTABLE6,20416.7%Rest: expected 56.5% + SLSA infra 3.6%

Finding 5: Sign-On/Off Rush Creates Predictable Spikes

Evening Sign-On (9–11pm)

Double-clicks, position limits, timeouts. Fix: Disable button after click, show position counts, loading timeout message.

Morning Sign-Off (11am–12pm)

"Already signed off" double-clicks, missing location. Fix: Disable after click, auto-populate location from GPS.

Finding 6: SLSA Azure Infrastructure Is the Weakest Link

1,089 server errors (3%) from SLSA's Azure API. Key issues: 503 app crashes return HTML (detect and handle), 502 bad gateway (retry twice), 524 Cloudflare timeouts (client-side timeout at 15s), "Too many connections" (17×, MySQL pool), "SLS-HUB-PROD not available" (1×, full DB outage).

Recommendation: Monthly infra report to SLSA API team. Advocate for: higher connection pool, Azure scaling rules, Cloudflare timeout increase.