fix(trickplay): stop scan-time sprite generation from saturating the host
Some checks failed
CI / Test (push) Failing after 6m21s
CI / Build (push) Successful in 1m34s
CI / Build-1 (push) Successful in 2m0s
CI / Build-2 (push) Successful in 1m33s
CI / Build-3 (push) Successful in 1m38s
CI / Build-4 (push) Successful in 1m35s
CI / Build-5 (push) Successful in 1m38s
CI / Lint (push) Failing after 2m34s
CI / Coverage (push) Failing after 2m44s
CI / Vet (push) Successful in 2m3s
Some checks failed
CI / Test (push) Failing after 6m21s
CI / Build (push) Successful in 1m34s
CI / Build-1 (push) Successful in 2m0s
CI / Build-2 (push) Successful in 1m33s
CI / Build-3 (push) Successful in 1m38s
CI / Build-4 (push) Successful in 1m35s
CI / Build-5 (push) Successful in 1m38s
CI / Lint (push) Failing after 2m34s
CI / Coverage (push) Failing after 2m44s
CI / Vet (push) Successful in 2m3s
Trickplay sprite generation (one full-decode ffmpeg pass per file) could pin a machine: multiple agents on the same library decoded the same 4K file at once, no CPU throttling, and crashed/restarted agents orphaned ffmpeg to init (it ran the full 45-min decode to completion). Stacked orphans spiked a box to load ~140. - Single-flight lock: O_CREATE|O_EXCL .lock in the shared sidecar dir so two agents watching the same library never decode the same file twice (stale locks reclaimed after a TTL). Returns ErrTrickplayInProgress → prewarm skips, not fail. - Load gate: defer the heavy decode until 1-min load ≤ max(ratio×NumCPU, 1.5), capped at 15 min so it throttles without ever becoming a permanent off-switch on busy / small hosts. New knob library.prewarm_max_load_ratio (default 0.7). - Concurrency: trickSem caps trickplay to ONE decode at a time per agent. - CPU priority: setLowCPUPriority (nice 19) alongside the existing idle ionice. - No orphans: hardenCmd sets Setpgid + Pdeathsig=SIGKILL, with runtime.LockOSThread around the child so the kernel kills ffmpeg exactly when the agent dies (and not spuriously — golang/go#27505). Tests: single-flight/stale-reclaim, load-gate immediate/cancel, and an e2e Pdeathsig orphan-kill check.
This commit is contained in:
parent
aba20e2078
commit
c82826bf68
10 changed files with 399 additions and 8 deletions
|
|
@ -2,7 +2,13 @@
|
|||
|
||||
package mediainfo
|
||||
|
||||
import "syscall"
|
||||
import (
|
||||
"os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"syscall"
|
||||
)
|
||||
|
||||
// Linux I/O priority (ioprio) constants. The 16-bit ioprio value packs a class
|
||||
// in the top 3 bits (shift 13) and a class-data nibble below it; the IDLE class
|
||||
|
|
@ -23,3 +29,47 @@ func setIdleIOPriority(pid int) {
|
|||
ioprio := ioprioClassIdle << ioprioClassShift // IDLE class, data 0
|
||||
_, _, _ = syscall.Syscall(syscall.SYS_IOPRIO_SET, uintptr(ioprioWhoProcess), uintptr(pid), uintptr(ioprio))
|
||||
}
|
||||
|
||||
// setLowCPUPriority best-effort drops a process to the lowest CPU niceness (19),
|
||||
// so the heavy trickplay full-decode pass yields the CPU to foreground work.
|
||||
// Pairs with setIdleIOPriority (disk): IDLE I/O alone is not enough when the
|
||||
// bottleneck is software/contended 4K decode — without CPU nice, N stacked
|
||||
// decodes pin every core (the host hit load ~140). Errors are ignored — it's an
|
||||
// optimization, not required for correctness.
|
||||
func setLowCPUPriority(pid int) {
|
||||
_ = syscall.Setpriority(syscall.PRIO_PROCESS, pid, 19)
|
||||
}
|
||||
|
||||
// hardenCmd makes the child ffmpeg die with this agent. Setpgid isolates it in
|
||||
// its own process group, and Pdeathsig=SIGKILL asks the kernel to kill it the
|
||||
// instant the agent process dies. Without this, exec.CommandContext can only
|
||||
// enforce its timeout from an in-process goroutine — an agent crash / restart /
|
||||
// SIGKILL kills that goroutine, so the ffmpeg is reparented to init (ppid 1) and
|
||||
// runs its full 45-min decode to the end. Successive dev restarts stacked those
|
||||
// orphans (one pair per restart) and spiked the box to load ~140.
|
||||
func hardenCmd(cmd *exec.Cmd) {
|
||||
if cmd.SysProcAttr == nil {
|
||||
cmd.SysProcAttr = &syscall.SysProcAttr{}
|
||||
}
|
||||
cmd.SysProcAttr.Setpgid = true
|
||||
cmd.SysProcAttr.Pdeathsig = syscall.SIGKILL
|
||||
}
|
||||
|
||||
// LoadAverage1 returns the 1-minute system load from /proc/loadavg. ok=false when
|
||||
// it can't be read, so callers treat "unknown" as "don't gate" (proceed) rather
|
||||
// than blocking forever.
|
||||
func LoadAverage1() (float64, bool) {
|
||||
b, err := os.ReadFile("/proc/loadavg")
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
fields := strings.Fields(string(b))
|
||||
if len(fields) == 0 {
|
||||
return 0, false
|
||||
}
|
||||
v, err := strconv.ParseFloat(fields[0], 64)
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
return v, true
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue