fix(trickplay): stop scan-time sprite generation from saturating the host
Some checks failed
CI / Test (push) Failing after 6m21s
CI / Build (push) Successful in 1m34s
CI / Build-1 (push) Successful in 2m0s
CI / Build-2 (push) Successful in 1m33s
CI / Build-3 (push) Successful in 1m38s
CI / Build-4 (push) Successful in 1m35s
CI / Build-5 (push) Successful in 1m38s
CI / Lint (push) Failing after 2m34s
CI / Coverage (push) Failing after 2m44s
CI / Vet (push) Successful in 2m3s
Some checks failed
CI / Test (push) Failing after 6m21s
CI / Build (push) Successful in 1m34s
CI / Build-1 (push) Successful in 2m0s
CI / Build-2 (push) Successful in 1m33s
CI / Build-3 (push) Successful in 1m38s
CI / Build-4 (push) Successful in 1m35s
CI / Build-5 (push) Successful in 1m38s
CI / Lint (push) Failing after 2m34s
CI / Coverage (push) Failing after 2m44s
CI / Vet (push) Successful in 2m3s
Trickplay sprite generation (one full-decode ffmpeg pass per file) could pin a machine: multiple agents on the same library decoded the same 4K file at once, no CPU throttling, and crashed/restarted agents orphaned ffmpeg to init (it ran the full 45-min decode to completion). Stacked orphans spiked a box to load ~140. - Single-flight lock: O_CREATE|O_EXCL .lock in the shared sidecar dir so two agents watching the same library never decode the same file twice (stale locks reclaimed after a TTL). Returns ErrTrickplayInProgress → prewarm skips, not fail. - Load gate: defer the heavy decode until 1-min load ≤ max(ratio×NumCPU, 1.5), capped at 15 min so it throttles without ever becoming a permanent off-switch on busy / small hosts. New knob library.prewarm_max_load_ratio (default 0.7). - Concurrency: trickSem caps trickplay to ONE decode at a time per agent. - CPU priority: setLowCPUPriority (nice 19) alongside the existing idle ionice. - No orphans: hardenCmd sets Setpgid + Pdeathsig=SIGKILL, with runtime.LockOSThread around the child so the kernel kills ffmpeg exactly when the agent dies (and not spuriously — golang/go#27505). Tests: single-flight/stale-reclaim, load-gate immediate/cancel, and an e2e Pdeathsig orphan-kill check.
This commit is contained in:
parent
aba20e2078
commit
c82826bf68
10 changed files with 399 additions and 8 deletions
46
internal/library/loadgate_test.go
Normal file
46
internal/library/loadgate_test.go
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
package library
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/torrentclaw/unarr/internal/library/mediainfo"
|
||||
)
|
||||
|
||||
// A huge ratio means the threshold is always above the real load, so the gate
|
||||
// must return immediately (no blocking) regardless of how busy the box is.
|
||||
func TestWaitForLowLoad_HighRatioReturnsImmediately(t *testing.T) {
|
||||
done := make(chan struct{})
|
||||
go func() {
|
||||
waitForLowLoad(context.Background(), 1e9)
|
||||
close(done)
|
||||
}()
|
||||
select {
|
||||
case <-done:
|
||||
case <-time.After(2 * time.Second):
|
||||
t.Fatal("waitForLowLoad blocked despite an impossibly-high threshold")
|
||||
}
|
||||
}
|
||||
|
||||
// With a tiny ratio the gate would block (load almost always exceeds it), but a
|
||||
// cancelled context must unblock it promptly — the prewarm has to stop cleanly on
|
||||
// Ctrl-C / daemon shutdown even while waiting for the machine to go idle.
|
||||
func TestWaitForLowLoad_RespectsContextCancel(t *testing.T) {
|
||||
if _, ok := mediainfo.LoadAverage1(); !ok {
|
||||
t.Skip("no load reading on this platform — gate is a no-op")
|
||||
}
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
cancel() // already cancelled
|
||||
|
||||
done := make(chan struct{})
|
||||
go func() {
|
||||
waitForLowLoad(ctx, 0.0001) // threshold ~0 → would otherwise block
|
||||
close(done)
|
||||
}()
|
||||
select {
|
||||
case <-done:
|
||||
case <-time.After(2 * time.Second):
|
||||
t.Fatal("waitForLowLoad ignored a cancelled context")
|
||||
}
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue