SLSA API Error Analysis Dashboard

Priority	Action Item	Expected Impact
HIGH	Continue database query optimization and indexing strategy. Monitor slow query log and add indexes proactively for new features.	Prevent repeat of Jan 17-18 outage pattern
HIGH	Implement rate limiting and circuit breakers for /sso/auth endpoint (59.2% success rate is critically low).	Reduce auth cascade failures, improve user experience
HIGH	Add automated alerting for error rate spikes above 3% on any endpoint within a 15-minute window.	Faster incident detection and response times
MED	Review /patrol-log/create validation logic. Success rate declining from 91.5% to 80.6% over 6 months.	Arrest declining patrol log creation reliability
MED	Implement client-side validation for "signed on elsewhere" and "duplicate patrol log" errors to reduce unnecessary API calls.	Reduce 40% of position-related errors at source
MED	Add structured error tracking per club to enable proactive outreach for clubs with persistent issues.	Reduce support burden, improve club satisfaction
LOW	Investigate 502/524 timeout errors — consider increasing upstream timeouts or adding request queuing for heavy operations.	Reduce intermittent gateway errors during peak load
LOW	Implement API versioning strategy to support backward-compatible changes without breaking existing integrations.	Safer deployments, reduced regression risk

Who Triggers the Errors? (January 2026 sample)

Using user_id presence to distinguish user-triggered vs system/cron errors.

66%

Cron / System

6,336 errors — no user_id

34%

User-Triggered

3,203 errors — has user_id

Cron errors (66%): Mostly expected — failed logins during member sync (4,602), unsigned patrol status checks (1,328), expired tokens. Users never see these errors.
User errors (34%): Directly impact the user experience — position limits (886), patrol log duplicates (637), sign-on validation (292+85), incidents (235). These need better error messages.

What Users See vs What Runs Silently

Source	Endpoint	Code	Errors	User Impact
User-Triggered — Users see these errors
User	`/patrol-log-members`	400	886	Position limit / signed elsewhere / time error
User	`/patrol-log/create`	403	637	"Patrol log already exists" — confusing for user
User	`/sign-on-lifeguard`	400	292	"Primary contact required" — form incomplete
User	`/incident`	400	235	Field too long / undefined injury — form error
User	`/patrol-log-members`	404	223	"Member not found" — stale data shown to user
User	`/patrol-team`	400	132	Time conflict / duplicate patrol time
User	`/patrol-log/create`	524	100	Timeout — user sees spinner then generic error
User	`/sign-on-club`	400	85	"Already signed on" — double-click
Cron/System — Users never see these
Cron	`/sso/auth`	404	4,602	Silent — member sync login attempts
Cron	`/service-status-details`	404	1,328	Silent — checking unsigned patrols
Cron	`/sso/auth`	403	150	Silent — password change needed
Cron	`/sign-off-lifeguard`	400	114	Silent — auto sign-off attempts
Cron	`/patrol-log/check-exists`	403	47	Silent — patrol log validation

When each error pattern was first and last observed. Span shows how long the issue has persisted. Green = resolved, Red = still active.

Error Pattern	Endpoint	Code	Count	First Seen	Last Seen	Span	Status
Position limit: 3	`/patrol-log-members`	400	669	2025-10-03	2026-02-28	148d	ACTIVE
Primary contact validation	`/sign-on-lifeguard`	400	548	2025-09-30	2026-02-26	149d	ACTIVE
Signed on elsewhere	`/patrol-log-members`	400	519	2025-09-30	2026-02-28	151d	ACTIVE
Already signed on (club)	`/sign-on-club`	400	402	2025-10-03	2026-02-28	148d	ACTIVE
Time validation	`/patrol-log-members`	400	401	2025-10-04	2026-01-21	109d	FIXED
Already signed off (LG)	`/sign-off-lifeguard`	400	287	2025-10-02	2026-02-28	149d	ACTIVE
Invalid unitId	`/unit-status-details`	400	235	2025-11-01	2026-02-27	118d	ACTIVE
Location too long	`/incident`	400	218	2025-11-04	2026-02-16	104d	ACTIVE
Invalid token (SSO verify)	`/sso/verify`	404	186	2025-10-01	2025-11-18	48d	GONE
524 Cloudflare Timeout	`/patrol-log/create`	524	168	2026-01-10	2026-02-21	42d	NEW
Duplicate patrol time	`/patrol-team`	400	165	2025-12-19	2026-02-22	65d	NEW
Undefined injury nature	`/incident`	400	162	2025-11-01	2026-02-26	117d	BUG
Invalid beach combo	`/sign-on-club`	400	38	2025-12-06	2025-12-14	8d	GONE
524 Cloudflare on patrol-team	`/patrol-team`	524	16	2026-01-10	2026-01-10	1d	1-DAY
TDS Protocol Error	`/sso/auth`	500	22	2025-11-26	2025-11-26	1d	1-DAY
DB Unavailable (SLS-HUB-PROD)	`/sso/auth`	500	1	2025-12-07	2025-12-07	1d	1-DAY
Server Shutdown	`/radio-logs`	500	1	2025-11-10	2025-11-10	1d	1-DAY

Key observations: Time validation fixed after Jan 21 (Clifton Beach resolved). 524 Cloudflare timeouts appeared Jan 10 — new issue, still active. TDS Protocol, DB unavailable, Server Shutdown — single-day incidents on SLSA side. Undefined injury nature — persistent JS bug active for 117 days, needs fix. Invalid beach combo — appeared Dec 6, gone by Dec 14 (8-day issue, likely config fix).

400 Bad Request — Validation Failures (6,058 total across 6 months)

Client sent invalid data. Most are preventable with frontend validation.

Endpoint	Actual API Response Body	Count	Fix
`/patrol-log-members`	`{"statusCode":400,"message":"Maximum patrol position has reached its limit: 3"}`	669	Enforce limit in UI
`/sign-on-lifeguard`	`{"statusCode":400,"message":["Primary contact must be one of: Radio, Mobile, Landline, SMR, or None.","Primary contact is required."]}`	548	Fix dropdown
`/patrol-log-members`	`{"statusCode":400,"message":"This member is currently signed on to another patrol (XXXXX)"}`	519	Check before add
`/sign-on-club`	`{"statusCode":400,"message":"Club is already signed on. Please sign off."}`	402	Check state
`/patrol-log-members`	`{"statusCode":400,"message":["FinishTime is either missing or not after StartTime","StartTime is either missing or not before FinishTime"]}`	401	Validate times
`/sign-off-lifeguard`	`{"statusCode":400,"message":"ServiceId is already Signed Off."}`	287	Disable re-click
`/sign-off-club`	`{"statusCode":400,"message":"ServiceId is already Signed Off."}`	239	Disable re-click
`/unit-status-details`	`{"statusCode":400,"message":"Invalid unitId not type Unit"}`	235	Validate unit type
`/incident`	`{"statusCode":400,"message":["location up to 50 characters only"]}`	218	maxlength=50
`/incident`	`{"statusCode":400,"message":"undefined is invalid injury nature id"}`	162	JS BUG
`/sign-on-support`	`{"statusCode":400,"message":["Odometer must be a valid integer."]}`	114	Validate number
`/patrol-log-members`	`{"statusCode":400,"message":"Member ID XXXXX does not exist for substitution."}`	111	Validate member
`/sign-on-lifeguard`	`{"statusCode":400,"message":"Lifeguard is already signed on. Please sign off."}`	77	Check state
`/incident`	`{"statusCode":400,"message":["firstSightedBy up to 60 characters only"]}`	65	maxlength=60
`/update-support`	`{"statusCode":400,"message":"Arrival time must be after the current datetime"}`	65	Validate datetime
`/radio-logs`	`{"statusCode":400,"message":["orgIds must be a number, comma-separated integers"]}`	59	Fix param format
`/patrol-log-members`	`{"statusCode":400,"message":"Maximum patrol position has reached its limit: 1"}`	54	Enforce in UI
`/rescues/add`	`{"statusCode":400,"message":["postalPostCode must be an integer","must be a positive number"]}`	52	Validate postcode
`/sign-off-support`	`{"statusCode":400,"message":["Current location is a mandatory field."]}`	40	Require field
`/incident`	`{"statusCode":400,"message":["rescuer up to 20 characters only"]}`	39	maxlength=20
`/sign-on-club`	`{"statusCode":400,"message":"Invalid Service Club Beach combination"}`	38	Validate beach
`/incident`	`{"statusCode":400,"message":["position up to 30 characters only"]}`	38	maxlength=30
`/incident`	`{"statusCode":400,"message":["vicAge must be a positive number"]}`	37	Validate number
`/sign-off-unit`	`{"statusCode":400,"message":"Unit is already Sign Off"}`	36	Disable re-click
`/patrol-log-members`	`{"statusCode":400,"message":"Maximum patrol position has reached its limit: 4"}`	32	Enforce in UI
`/sign-on-unit`	`{"statusCode":400,"message":"Unit is already signed on. Please sign off."}`	30	Check state
`/patrol-team`	`{"statusCode":400,"message":["FinishTime is either missing or not after StartTime"]}`	26	Validate times
`/patrol-team`	`{"statusCode":400,"message":"There is already a Patrol Group date with the time you selected."}`	21	Check conflicts
`/sso/auth`	`{"statusCode":403,"message":"The password you entered is too long"}`	20	Add maxlength
`/update-lifeguard`	`{"statusCode":400,"message":"ServiceId is already Signed Off. Cannot Update"}`	17	Check state

403 Forbidden — Access Denied / Duplicates (2,823 total)

Endpoint	Actual API Response Body	Count
`/patrol-log/create`	`{"statusCode":403,"message":"Patrol Log already exists for this Patrol Group Date. (Patrol Log ID: XXXXX)"}`	1,587
`/sso/auth`	`{"statusCode":403,"message":"Please login to the SLS Hub and change your password."}`	586
`/patrol-log/check-exists`	`{"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"}`	282
`/patrol-log/create`	`{"statusCode":403,"message":"Cannot find PatrolGroupDateID: XXXXX"}`	148
`/sso/auth`	`{"statusCode":403,"message":"The password you entered is too long"}`	20

404 Not Found — Resource Missing (21,721 total)

Endpoint	Actual API Response Body	Count	Expected?
`/sso/auth`	`{"statusCode":404,"message":"We couldn't find an account with the Username and Password provided"}`	15,969	Yes — wrong password
`/service-status-details`	`{"statusCode":404,"message":"No available logClubs found"}`	4,488	Yes — not signed on
`/patrol-log-members`	`{"statusCode":404,"message":"Cannot find MemberID: XXXXX"}`	275	No — stale data
`/patrol-log-members`	`{"statusCode":404,"message":"Member ID XXXXX was not found in the patrol log with ID XXXXX."}`	261	No — sync issue
`/patrol-log-members`	`{"statusCode":404,"message":"Member ID XXXXX was not found in patrol log XXXXX under position ID XXXXX."}`	222	No — sync issue
`/sso/verify`	`{"statusCode":404,"message":"Client not found or invalid access token"}`	186	Partial — token expiry
`/service-status-details`	`{"statusCode":404,"message":"No available log lifeguards found"}`	56	Yes — not signed on

5xx Server Errors — SLSA Infrastructure (1,089 total — Azure: slsaapi.scm.azurewebsites.net)

Code	Endpoint	Actual Response Body	Count
500 Internal Server Error — SLSA code bugs & DB issues (66 total)
500	`/sso/auth`	`{"message":"Internal server error3 - Error: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 3 (\"@0\"): Data type 0xE7 has an invalid data length"}`	22
500	`/incident/all`	`{"message":"Internal server error3 - ER_CON_COUNT_ERROR: Too many connections"}`	17
500	`/sign-off-lifeguard`	`{"message":"Internal server error3 - connect ETIMEDOUT"}`	4
500	`/sign-off-club`	`{"message":"Internal server error3 - Cannot read properties of undefined (reading 'PatrolDate')"}`	2
500	`/update-lifeguard`	`{"message":"Internal server error3 - Cannot read properties of null (reading 'beach_name')"}`	2
500	`/sso/auth`	`{"message":"Internal server error3 - Database 'SLS-HUB-PROD' on server 'sls-p-apidb-01' is not currently available."}`	1
500	`/radio-logs`	`{"message":"Internal server error3 - ER_SERVER_SHUTDOWN: Server shutdown in progress"}`	1
502 Bad Gateway — Upstream not responding (741 total, 17 endpoints)
502	`/patrol-log-members`	`"error code: 502"`	331
502	`/patrol-log/create`	`"error code: 502"`	100
502	`/radio-logs`	`"error code: 502"`	68
502	`/patrol-log-stats/update`	`"error code: 502"`	41
502	`/gear/entity-assets`	`"error code: 502"`	38
502	`/sso/auth`	`"error code: 502"`	30
502	+ 11 more endpoints (incident/all 20, units-for-entity 18, season-dates 16, sign-on-club 14, update-club 13, sign-off-club 9, ...)		113
503 App Crash — Azure returns HTML error page (322 total, 20 endpoints)
503	`<h1 style="color: 747474">:( Application Error</h1><p>If you are the application administrator, you can access the <a href="https://slsaapi.scm.azurewebsites.net/detectors">diagnostic resources</a>.</p>`		322
	Top affected: /patrol-log-members (104), /radio-logs (41), /patrol-log/create (39), /sign-on-club (30), /service-status-details (30), /sso/auth (18), /patrol-log/check-exists (15), /patrol-log-stats (14), /unit-status-details (7), + 11 more
524 Cloudflare Timeout (226 total)
524	`/patrol-log/create`	`"error code: 524"`	168
524	`/patrol-log-members`	`"error code: 524"`	33
524	`/patrol-team`	`"error code: 524"`	16
524	+ /sso/auth (4), /patrols/roster (3), /incident/all (1), /radio-logs (1)		9

5xx Key Findings: SLSA API runs on Azure App Service (crash page leaks slsaapi.scm.azurewebsites.net). "Too many connections" (17x) = MySQL pool exhaustion. TDS protocol error (22x) = SQL Server param type mismatch. "Server shutdown in progress" (1x) = DB restart during query. "SLS-HUB-PROD not available" (1x) = Production DB outage. Null reference bugs: reading 'PatrolDate' and reading 'beach_name' = JS null pointer in SLSA code.

Authentication Errors — 16,741 events (46% of all errors)

404 — Invalid credentials (15,969): Users entering wrong username/password. This is expected user-input error, not a system bug. Rate is stable at ~40% of auth attempts.

403 — Password change required (586): SLSA Hub requires password update. Users redirected to hub.sls.com.au.

404 — Token expired (186): SSO verify fails when token has expired between auth and verify calls.

Impact: Users see "login failed" — no data loss. But high volume (12K/month) suggests UX could guide users better.

Recommendation: Add "Forgot password?" link, rate limit after 5 failures, show SLS Hub redirect for 403 errors.

Patrol Position Errors — 4,197 events on /patrol-log-members

400 — Position limit reached (755): Attempting to add more members than the position allows (e.g., max 3 IRB Drivers). Frontend doesn't enforce limits.

400 — Time validation (401): FinishTime missing or before StartTime. Mostly Clifton Beach Dec 2025 (resolved).

502/503/524 — Server errors (468): SLSA infrastructure failures. 502 Bad Gateway (331), 503 Application Error (104), 524 Cloudflare timeout (33).

Impact: Member not synced to SLSA central. Local roster shows the member but central doesn't know.

Recommendation:

Fetch position limits and validate in frontend
Add retry with exponential backoff for 5xx errors
Validate time fields before API submission
Show specific error message to user, not generic failure

Patrol Log Create — 2,042 errors (87.4% success, DEGRADING)

403 — Duplicate log / invalid ID (~900): "Patrol Log already exists" — expected duplicate detection. Also "Cannot find PatrolGroupDateID" when patrol doesn't exist in central system.

502/524 — Server timeouts (~268): SLSA server can't process the request in time. Avg response 1.99s, max 126s.

503 — Application crash (39): Full SLSA app error page returned instead of JSON.

Trend: 91.5% (Nov) → 84.3% (Feb) → 80.6% (Mar)

This is the most concerning endpoint. Success rate has dropped 11 percentage points over 5 months.

Recommendation:

Handle 403 duplicates gracefully (extract log ID from error)
Implement queue-based retry for 5xx errors
Escalate timeout issue to SLSA V2 API team

Sign-On / Sign-Off Errors — 1,663 events across 6 endpoints

Lifeguard sign-on (635 errors, 88.2%): Primary contact validation (548), already signed on (77).

Club sign-on (484, 95.0%): Already signed on (402), invalid beach combo (38).

Sign-off club/lifeguard (560): "Already Signed Off" — double sign-off attempts.

Root cause: The app doesn't check the current sign-on/off state before submitting to the API. Users can click sign-on twice, or try to sign off a patrol that's already signed off.

Recommendation:

Check status before sign-on/off API call
Disable button after first click (prevent double-submit)
Fix primary contact field validation for lifeguards

Incident Reporting Errors — 620 events (94.8% success)

Field length violations:

"location up to 50 characters only" (218)
"undefined is invalid injury nature id" (162) — likely a bug
"firstSightedBy up to 60 characters only" (65)
"rescuer up to 20 characters only" (39)
"vicAge must be a positive number" (37)
"position up to 30 characters only" (38)

Root cause: Frontend text inputs don't enforce API field length limits. The "undefined injury nature id" error (162 occurrences) is likely a JavaScript bug sending undefined values.

Recommendation:

Add maxlength to all input fields matching API limits
Fix the undefined injury_nature_id bug
Validate numeric fields (vicAge, postalPostCode) before submit

Infrastructure Errors (5xx) — 1,089 events across all endpoints

502 Bad Gateway (331 on /patrol-log-members, 100 on /patrol-log/create, 68 on /radio-logs, others): SLSA upstream server not responding. Proxy returns generic 502.

524 Cloudflare Timeout (168 on /patrol-log/create, 33 on /patrol-log-members): Request exceeded Cloudflare's timeout threshold.

503 Application Error (104 on /patrol-log-members, 41 on /radio-logs, 39 on /patrol-log/create): Full HTML error page returned instead of JSON — SLSA app crash.

Impact: These are SLSA server-side issues, not Operations App bugs. However, the app should handle them gracefully.

Monthly 5xx trend:

Oct: 43 • Nov: 176 • Dec: 514 • Jan: 362 • Feb: 261

Recommendation: Implement retry with backoff for 5xx. Parse HTML error pages to detect app crashes vs gateway errors. Report 5xx rates to SLSA API team.

Root Cause	6-Month Total	% of Errors	Actionable?	Who Fixes?
Failed logins (wrong credentials)	15,969	44.0%	Expected	User education / UX
Unsigned patrol status check (404)	4,544	12.5%	Expected	Handle 404 as "not signed on"
SLSA server errors (502/503/524)	1,089	3.0%	Infra	SLSA API team
Duplicate state (already signed on/off)	1,005	2.8%	Check First	Nano — pre-check state
Patrol log duplicates/not found	900	2.5%	Handle	Nano — handle 403 gracefully
Position limits exceeded	755	2.1%	Prevent	Nano — frontend validation
Password change required	586	1.6%	Expected	User redirect to SLS Hub
Validation failures (field length etc)	559	1.5%	Prevent	Nano — input validation
Lifeguard primary contact validation	548	1.5%	Prevent	Nano — fix form field
Time validation errors	401	1.1%	Fixed	Resolved (Clifton Beach)
TOTAL ERRORS	36,281	100%	56.5% expected/user-error • 43.5% actionable

Bottom line: Of 36,281 total errors, 56.5% are expected (failed logins, unsigned patrols). The remaining 43.5% (15,800 errors) are actionable — split between Nano fixes (frontend validation, state checking) and SLSA infrastructure issues (5xx errors).

Finding 1: January 10 Cascade Failure — 167 Server Errors in 2 Hours

Between 9–11pm on Jan 10, the SLSA API buckled — 63 Cloudflare timeouts on /patrol-log/create, then 502s cascaded to /patrol-log-members and /radio-logs. 135 unique users affected. This was a precursor to the larger Jan 17-18 incident.

Recommendation: Circuit breaker — if 5xx rate exceeds 5% in 5 minutes, queue requests and retry with backoff.

Finding 2: "Repeat Offender" Users — Top User Hit 246 Errors in 13 Days

The top 15 users account for 885 errors (28% of user-triggered errors). These are patrol captains retrying the same failed action — the app doesn't explain what went wrong, so they click again hoping it'll work.

Recommendation: After 3 consecutive errors, show contextual help. Track error-per-user rates for proactive club outreach.

Finding 3: Errors Are Slow — Users Wait 20 Seconds for Nothing

/patrol-log/create
19.5s error vs 0.4s success
49x slower

/patrol-team
13.9s error vs 0.3s success
46x slower

/patrol-log-members
0.7s error vs 0.2s success
3.5x slower

Recommendation: Client-side timeout at 10s with "Still working..." message. Validate before sending to catch 400s at 0ms.

Finding 4: Prevention Roadmap — 16.7% Reduction With Small Effort

Strategy	Errors Prevented	%	Effort	Priority
Handle duplicates gracefully	1,587	4.4%	Small	HIGH
Retry 5xx with backoff	1,312	3.6%	Medium	HIGH
Check state before API call	1,119	3.1%	Small	HIGH
Frontend limit enforcement	755	2.1%	Small	MED
Input validation (maxlength)	721	2.0%	Small	MED
Fix primary contact dropdown	548	1.5%	Tiny	MED
Fix undefined injury JS bug	162	0.4%	Tiny	HIGH
TOTAL PREVENTABLE	6,204	16.7%	Rest: expected 56.5% + SLSA infra 3.6%

Finding 5: Sign-On/Off Rush Creates Predictable Spikes

Evening Sign-On (9–11pm)

Double-clicks, position limits, timeouts. Fix: Disable button after click, show position counts, loading timeout message.

Morning Sign-Off (11am–12pm)

"Already signed off" double-clicks, missing location. Fix: Disable after click, auto-populate location from GPS.

Finding 6: SLSA Azure Infrastructure Is the Weakest Link

1,089 server errors (3%) from SLSA's Azure API. Key issues: 503 app crashes return HTML (detect and handle), 502 bad gateway (retry twice), 524 Cloudflare timeouts (client-side timeout at 15s), "Too many connections" (17×, MySQL pool), "SLS-HUB-PROD not available" (1×, full DB outage).

Recommendation: Monthly infra report to SLSA API team. Advocate for: higher connection pool, Azure scaling rules, Cloudflare timeout increase.

SLSA API Error Analysis Dashboard

Dec 2025: Clifton Beach

Jan 17–18: Database Performance

Errors by Hour of Day (AEST)

Errors by Day of Week

Sign-On vs Sign-Off Error Timing — The Two Rush Hours

Sign-On Errors (Peak: 9–11pm AEST)

Sign-Off Errors (Peak: 11am–12pm AEST)

Who Triggers the Errors? (January 2026 sample)

What Users See vs What Runs Silently

400 Bad Request — Validation Failures (6,058 total across 6 months)

403 Forbidden — Access Denied / Duplicates (2,823 total)

404 Not Found — Resource Missing (21,721 total)

5xx Server Errors — SLSA Infrastructure (1,089 total — Azure: slsaapi.scm.azurewebsites.net)

Authentication Errors — 16,741 events (46% of all errors)

Patrol Position Errors — 4,197 events on /patrol-log-members

Patrol Log Create — 2,042 errors (87.4% success, DEGRADING)

Sign-On / Sign-Off Errors — 1,663 events across 6 endpoints

Incident Reporting Errors — 620 events (94.8% success)

Infrastructure Errors (5xx) — 1,089 events across all endpoints

Finding 1: January 10 Cascade Failure — 167 Server Errors in 2 Hours

Finding 2: "Repeat Offender" Users — Top User Hit 246 Errors in 13 Days

Finding 3: Errors Are Slow — Users Wait 20 Seconds for Nothing

Finding 4: Prevention Roadmap — 16.7% Reduction With Small Effort

Finding 5: Sign-On/Off Rush Creates Predictable Spikes

Finding 6: SLSA Azure Infrastructure Is the Weakest Link