feat(daemon): telemetría de salud continua + heartbeat de sesiones copy
El watcher F3 posteaba UN snapshot de speed= al arrancar y moría: un encoder
sano en el minuto 1 que se ahoga en el minuto 20 (escena compleja, GPU robada
por otro proceso) era invisible para el triage de stalls del player, que
decidía con el dato de arranque.
- monitorSessionHealth: ticker 5s el resto de la sesión; re-postea al cambiar
el bucket ok/marginal/struggling (con histéresis de 2 ticks — una EWMA
bailando sobre 0.95 no puede webhookear cada 5s) o al derivar el ratio
≥0.15. Un POST fallido NO avanza el baseline: el tick siguiente reintenta
(perder el único webhook de la transición a struggling cegaba al player
justo en el caso que esto existe para cubrir).
- resetTranscodeStats() en restartFromSegment: el ffmpeg nuevo de un seek
re-arma el warmup y resiembra la EWMA — sus frames fríos (speed=0.0x)
hundían la media curada a <0.75 y el monitor habría posteado un
"struggling" falso que pausaba el player en pleno seek. Verificado e2e:
dos restarts (seek a 1200s) con health estable en ok.
- inputBound ventanado (30s) en vez de pegajoso: un blip de lectura
transitorio ya no reclasifica como input_bound/struggling cada dip <0.95
durante el resto de una sesión de horas.
- Heartbeat copy (F2): las sesiones -c:v copy postean una vez
{ok, 1.0, "copy"} tras el ready — la web ya distingue "sesión copy" de
"agente viejo sin telemetría" (ambos eran null). Segundo POST deliberado:
un 400 de una web vieja (enum sin "copy") jamás debe bloquear el ready.
- Logs de fallo etiquetados por tipo de POST: un heartbeat fallido ya no se
lee como "mark-ready failed" (el ready SÍ aterrizó).
Requiere web con session-ready/SSE actualizados (desplegar web primero;
contra web vieja todo degrada a best-effort con log).
This commit is contained in:
parent
2b9d576aee
commit
f0c51c5d90
3 changed files with 129 additions and 12 deletions
|
|
@ -317,7 +317,12 @@ type HLSSession struct {
|
|||
fpsEWMA float64
|
||||
speedSamples int
|
||||
warmupSeen int // cold-start frames discarded before the EWMA is trusted
|
||||
inputBound bool
|
||||
// Walltime of the LAST source-read error ffmpeg reported. Windowed (see
|
||||
// hlsInputBoundWindow) instead of a sticky bool: with the F1 continuous
|
||||
// monitor a single transient read blip (peer drop, debrid hiccup ffmpeg
|
||||
// reconnects through) must not reclassify every sub-realtime dip as
|
||||
// "input_bound/struggling" for the rest of a multi-hour session.
|
||||
inputErrAt time.Time
|
||||
}
|
||||
|
||||
// hlsSeekAhead is how many segments past the writer's current position the
|
||||
|
|
@ -395,6 +400,12 @@ func (r *HLSSessionRegistry) CloseWhere(pred func(*HLSSession) bool) int {
|
|||
// cache-fill (HLSSessionConfig.Prewarm). cfg is immutable after construction.
|
||||
func (s *HLSSession) IsPrewarm() bool { return s.cfg.Prewarm }
|
||||
|
||||
// IsVideoCopy reports whether this session serves -c:v copy (no video
|
||||
// re-encode). Copy sessions emit no ffmpeg -stats telemetry, so the ready
|
||||
// watcher posts a one-shot "copy" health heartbeat instead of waiting for
|
||||
// speed= samples that never arrive.
|
||||
func (s *HLSSession) IsVideoCopy() bool { return s.cfg.VideoCopy }
|
||||
|
||||
// RegisterKeep adds a session WITHOUT displacing the others — the prewarm
|
||||
// path: a background cache-fill encode must not evict the viewer's live
|
||||
// session (Register's eviction killed the stream being watched when the
|
||||
|
|
@ -830,11 +841,16 @@ func (s *HLSSession) GetTranscodeStats() TranscodeStats {
|
|||
SpeedX: s.speedEWMA,
|
||||
Fps: s.fpsEWMA,
|
||||
Samples: s.speedSamples,
|
||||
InputBound: s.inputBound,
|
||||
InputBound: !s.inputErrAt.IsZero() && time.Since(s.inputErrAt) < hlsInputBoundWindow,
|
||||
FromCache: s.fromCache,
|
||||
}
|
||||
}
|
||||
|
||||
// hlsInputBoundWindow bounds how long a source-read error keeps classifying
|
||||
// the session as input-bound. Past it, a sub-realtime encode is the encoder's
|
||||
// own problem again (the transient link blip resolved or ffmpeg reconnected).
|
||||
const hlsInputBoundWindow = 30 * time.Second
|
||||
|
||||
// hlsStatsWarmupSkip is how many leading -stats frames to discard before
|
||||
// trusting the EWMA. ffmpeg's first readings reflect the pipeline filling
|
||||
// (often speed=0.0x) and would otherwise drag a healthy encoder into a false
|
||||
|
|
@ -868,7 +884,22 @@ func (s *HLSSession) recordProgress(speedX, fps float64) {
|
|||
// the input pull (slow debrid link / dropped torrent peer), not the encoder.
|
||||
func (s *HLSSession) markInputBound() {
|
||||
s.statsMu.Lock()
|
||||
s.inputBound = true
|
||||
s.inputErrAt = time.Now()
|
||||
s.statsMu.Unlock()
|
||||
}
|
||||
|
||||
// resetTranscodeStats re-arms the cold-start warmup and drops the EWMAs +
|
||||
// input-error mark. MUST be called whenever a NEW ffmpeg process starts
|
||||
// inside the same session (seek restart, auto-restart supervisor): the new
|
||||
// process's pipeline-fill frames read speed=0.0x, and folding them into the
|
||||
// already-warmed EWMA drags a healthy 1.5x encode under the 0.75 struggling
|
||||
// floor in two samples — which the F1 health monitor would then report as a
|
||||
// false "struggling" (pausing the player) right at the seek the user made.
|
||||
func (s *HLSSession) resetTranscodeStats() {
|
||||
s.statsMu.Lock()
|
||||
s.warmupSeen = 0
|
||||
s.speedSamples = 0 // recordProgress re-seeds the EWMA on the next sample
|
||||
s.inputErrAt = time.Time{}
|
||||
s.statsMu.Unlock()
|
||||
}
|
||||
|
||||
|
|
@ -1415,6 +1446,7 @@ func (s *HLSSession) restartFromSegment(targetIdx int) error {
|
|||
}
|
||||
|
||||
// Reset session state so the poll + wait machinery picks up the new run.
|
||||
s.resetTranscodeStats() // new ffmpeg = new cold ramp; don't poison the EWMA
|
||||
s.mu.Lock()
|
||||
s.cmd = cmd
|
||||
s.cancel = cancel
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue