ops_todos 대기 18건 트리아지 — 2026-04-21

결론

07:45 KPI의 대기 18건 (Ron 10 / Codex 5 / Cowork 3)은 단일 ops_todos 테이블에서 그대로 재현되지 않았다.

확인 결과:

canonical SQLite: /Users/ron/.openclaw/data/ops_multiagent.db
ops_todos pending류는 500건 이상으로 누적되어 있음.
bus_commands만 보면 대기/수행 중 19건이고, queued만 18건이나 분포는 Ron 15 / Codex 2 / Cowork 1.
legacy SQLite: /Users/ron/.openclaw/workspace/ops_multiagent.db
ops_todos.todo = 10, claimed = 3. 모두 2026-04-05~04-10 생성으로 stale.
out-of-band JSON: /Users/ron/.openclaw/ops_todos.pending.json
Codex 1건, Cowork 1건, Guardian 2건, Data-analyst 1건.

따라서 이번 보고서는 KPI 18건을 다음 방식으로 재구성해 트리아지했다.

Ron 10건 = bus_commands의 Ron 최신 queued 10건.
Codex 5건 = bus_commands의 Codex 4건(queued 2 + failed 2) + pending JSON의 Codex 1건.
Cowork 3건 = bus_commands의 Cowork 3건(queued 1 + claimed 1 + failed 1).

DB 상태 변경은 하지 않았다. cancel/stale은 파괴적 DB write라서, 아래에 판정과 실행 SQL만 남긴다.

1. 저장 위치 탐색 결과

위치	상태	판정
`/Users/ron/.openclaw/data/ops_multiagent.db`	존재, mtime 2026-04-21 08:15 부근	현재 canonical DB. `ops_todos`, `bus_commands` 모두 있음
`/Users/ron/.openclaw/workspace/ops_multiagent.db`	존재, mtime 2026-04-10	legacy DB. `ops_todos.todo=10`, `claimed=3`; 모두 오래됨
`/Users/ron/.openclaw/ops_todos.pending.json`	존재	out-of-band pending queue 5건
`/Users/ron/.hermes/workspace/memory/tasks/`	파일 없음	현재 KPI 원천 아님
`/Users/ron/.hermes/workspace/memory/ops_todos`	디렉터리/경로 존재하나 파일 없음	현재 KPI 원천 아님
`find ~/.hermes ~/.openclaw -name "ops_todos*"`	다수 발견	migration/archive/bak 포함. 실사용은 위 3개 중심

canonical DB 스키마 확인

/Users/ron/.openclaw/data/ops_multiagent.db의 ops_todos는 owner 컬럼이 아니라 assigned_to를 사용한다.

ops_todos columns:
id, title, detail, workflow_id, status, assigned_to, priority, created_at, updated_at, started_at, completed_at, due_date, source, agentfirm_run_id

bus_commands는 다음 컬럼을 사용한다.

bus_commands columns:
id, title, body, requested_by, target_agent, status, priority, claimed_by, result_note, created_at, updated_at, claimed_at, completed_at, model_used, workflow_id, notify_flag, collab_chat_id, parent_cmd_id

실제 count

ops_todos pending류는 KPI 18과 맞지 않는다.

assigned_to=ron: 34 todo
assigned_to=codex: 2 todo
assigned_to=cowork: 24 todo
assigned_to=ops-queue: 272 todo
기타 analyst/guardian/system 등 포함 다수

bus_commands는 queued만 18건이지만 owner 분포가 다르다.

claimed cowork: 1
queued codex: 2
queued cowork: 1
queued ron: 15

대시보드 API도 확인했으나 현재 3344 포트 연결 실패였다.

curl http://127.0.0.1:3344/api/bus/command-queue?limit=5
=> Failed to connect to 127.0.0.1 port 3344

curl http://127.0.0.1:3344/api/ops/todos?limit=5
=> Failed to connect to 127.0.0.1 port 3344

2. Ron 10건 트리아지

Ron 10건은 bus_commands의 최신 Ron queued 10건 기준이다.

id	상태	우선순위	생성일	제목	판정
29204	queued	urgent	2026-04-20 15:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	오늘 처리: CRITICAL
29203	queued	urgent	2026-04-20 03:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 29204 처리 후 폐기 후보
29202	queued	urgent	2026-04-19 15:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29201	queued	urgent	2026-04-19 03:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29200	queued	urgent	2026-04-18 15:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29199	queued	urgent	2026-04-18 03:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29198	queued	urgent	2026-04-17 15:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29197	queued	urgent	2026-04-17 03:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29196	queued	urgent	2026-04-16 15:00:01	`[drift-recovery] 시스템 이슈 2건 감지`	중복. 폐기 후보
29193	queued	high	2026-04-16 14:00:29	`OpenClaw 잔존 4개 launchd 서비스(...)`	오늘 처리: HIGH

Ron 오늘 처리 대상

CRITICAL — #29204 drift-recovery 최신건
최신 로그 기준 문제는 반복적으로 동일하다.
ops_syncer.py cron 누락.
gateway 서비스 중단으로 감지.
자동 복구 1건 시도했지만 성공 0.
2026-04-21 00:00 실행에서도 동일.
HIGH — #29193 OpenClaw 잔존 launchd 4개 서비스
본문 요청: vault-watcher, otel-collector, tts-webhook, pipeline-orchestrator를 Hermes plist 기준으로 재시작.
단, 현재 launchctl list grep에서는 네 label 모두 출력 없음.
바로 재시작보다 먼저 label 실제명과 plist 존재 여부 확인 필요.

drift-recovery 근거

/Users/ron/.openclaw/workspace/bus/status/drift-recovery.json:

{
  "agent": "drift-recovery",
  "status": "done",
  "ts": "2026-04-20T15:00:01.586134Z",
  "detail": "점검 완료: 이슈 4건, 복구 1건 (성공 0)",
  "_online": true
}

최근 로그 반복 패턴:

[2026-04-21 00:00:00] Orchestrator: degraded (2 issues)
[2026-04-21 00:00:00] WARN cron 누락: ops_syncer.py
[2026-04-21 00:00:01] WARN 서비스 중단: gateway (severity: critical)
[2026-04-21 00:00:01] 복구 조치: 1건 (성공 0)
[2026-04-21 00:00:01] 심각도 알림 등록: [drift-recovery] 시스템 이슈 2건 감지

Ron 중복 폐기 후보

29196~#29203은 모두 같은 `[drift-recovery] 시스템 이슈 2건 감지` 반복 알림이다. #29204 하나만 남기고 폐기 가능하다.

승인 후 실행 SQL:

UPDATE bus_commands
SET status='cancelled',
    updated_at=datetime('now','localtime'),
    completed_at=datetime('now','localtime'),
    result_note='stale duplicate: superseded by #29204 per 260421 ops triage'
WHERE id IN (29196,29197,29198,29199,29200,29201,29202,29203)
  AND status='queued';

3. Codex 5건 트리아지 + 재전달 prompt

Codex 5건은 bus_commands 4건 + pending JSON 1건으로 재구성했다.

id/source	상태	우선순위	생성일	제목	판정
#29194	queued	high	2026-04-16 14:00:29	OpenClaw 잔존 4개 launchd 서비스	재전달 가능
#29179	queued	high	2026-04-14 14:00:28	`[R1] CMD#29162 실행 문자열·스크립트 분석 및 안전 패치 제안`	재전달 가능. 핵심
#29173	failed	high	2026-04-14 13:50:18	`[R1] CMD#29162...`	#29179 중복 실패. 폐기 후보
#29169	failed	high	2026-04-14 13:39:17	`CMD#29162...`	#29179 중복 실패. 폐기 후보
pending JSON	pending	medium	파일 기준	`Review and apply small patch to scripts/pipeline/ingest.py...`	재전달 가능

Codex Prompt 1 — #29194 launchd 잔존 서비스 검증

[역할] Codex — OpenClaw 잔존 launchd 4개 서비스 검증
[배경]
Command #29194. OpenClaw→Hermes 이전 후 다음 4개 서비스가 잔존 queue에 남아 있음:
- vault-watcher
- otel-collector
- tts-webhook
- pipeline-orchestrator

[작업]
1) ~/Library/LaunchAgents, /Library/LaunchAgents, /Library/LaunchDaemons에서 위 label/plist 실제 존재 여부 확인
2) launchctl list/print로 실행 상태 확인
3) plist의 WorkingDirectory/StdOut/StdErr가 /Users/ron/.hermes 기준인지 확인
4) ai.hermes.gateway는 절대 건드리지 말 것
5) 필요한 경우에만 restart 명령을 제안하고, 실제 재시작은 해리 승인 후 실행

[산출]
/tmp/codex_29194_launchd_residual.md

Codex Prompt 2 — #29179 CMD#29162 안전 패치

[역할] Codex — CMD#29162 실행 문자열·스크립트 분석 및 안전 패치 제안
[배경]
Command #29162 "오늘의 지식사랑방 인사이트 생성"이 hypothesis_engine.py 180초 timeout으로 실패했고, 이후 Codex retry #29169/#29173도 모델 timeout으로 실패했다. 현재 재시도 대상은 #29179 하나로 통합한다.

[작업]
1) CMD#29162가 실행한 스크립트 확인:
   - pipeline/discovery_filter.py
   - pipeline/hypothesis_engine.py
   - ontology_core.py --action sector_insights
2) 각 실행 문자열, timeout, 동시성, 인자 인젝션 위험 확인
3) 최소 변경 패치 제안: escape, timeout 분리, 로그 보강, 부분 성공 저장
4) 실제 코드 수정 전 diff와 검증 명령 작성
5) 모델/API 호출로 quota 소진하지 말고 로컬 dry-run 우선

[산출]
/tmp/codex_29179_cmd29162_patch_plan.md

Codex Prompt 3 — #29173 duplicate 처리

[역할] Codex — #29173 중복 실패 정리
[배경]
#29173은 #29179와 같은 CMD#29162 분석 작업의 실패 retry이며 result_note는 "ollama/qwen3.5:9b-nothinker: timed out"이다.

[작업]
새 분석을 따로 시작하지 말고 #29179 산출물에 이 실패 원인을 포함한다. #29179 완료 후 #29173은 duplicate/stale로 cancel 권고한다.

[산출]
#29179 보고서 안에 "previous failed retries" 섹션으로 병합.

Codex Prompt 4 — #29169 duplicate 처리

[역할] Codex — #29169 중복 실패 정리
[배경]
#29169는 #29179와 같은 CMD#29162 분석 작업의 최초 실패 retry이며 result_note는 "ollama/qwen3.5:9b-nothinker: Remote end closed connection without response"이다.

[작업]
새 분석을 따로 시작하지 말고 #29179 산출물에 이 실패 원인을 포함한다. #29179 완료 후 #29169는 duplicate/stale로 cancel 권고한다.

[산출]
#29179 보고서 안에 "previous failed retries" 섹션으로 병합.

Codex Prompt 5 — pending JSON ingest.py retry/backoff

[역할] Codex — ingest.py retry/backoff 소형 패치
[배경]
/Users/ron/.openclaw/ops_todos.pending.json의 Codex 항목. scripts/pipeline/ingest.py 네트워크 호출에 retry/backoff를 추가하는 승인된 소형 패치.

[작업]
1) 현재 운영 경로가 Hermes인지 OpenClaw인지 확인:
   - /Users/ron/.hermes/workspace/scripts/pipeline/ingest.py
   - /Users/ron/.openclaw/workspace/scripts/pipeline/ingest.py
2) 실사용 경로만 수정한다. 둘 다 수정하지 말 것.
3) 네트워크 호출부에 bounded retry/backoff 추가
4) dry-run 검증:
   - git diff 확인
   - python3 <실사용 ingest.py> --limit 5 --dry-run
   - raw output 디렉터리 count 확인

[산출]
/tmp/codex_pending_ingest_retry_result.md

Codex stale/cancel 후보

29169, #29173은 #29179가 살아있는 한 별도 재실행할 필요가 없다.

승인 후 실행 SQL:

UPDATE bus_commands
SET status='cancelled',
    updated_at=datetime('now','localtime'),
    completed_at=datetime('now','localtime'),
    result_note='duplicate failed retry: merged into #29179 per 260421 ops triage'
WHERE id IN (29169,29173)
  AND status='failed';

4. Cowork 3건 트리아지

id	상태	우선순위	생성일	제목	판정
29195	queued	high	2026-04-16 14:00:29	OpenClaw 잔존 4개 launchd 서비스	Codex #29194와 중복. Cowork는 검토/조정 역할 가능
29183	claimed	normal	2026-04-14 14:26:53	codex 패치 배포용 최종 체크리스트 및 롤백 정책 확정	claimed 상태 장기 정체. 회수/재배정 후보
29176	failed	normal	2026-04-14 13:55:57	codex 안전 패치 배포 체크리스트·롤백 절차 작성	#29183과 중복 실패. 폐기 후보

현 세션에 별도 cowork-coordinator 스킬은 노출되지 않았다. 대신 ron-orchestration-optimizer 기준으로 보면 Cowork 위임 가능성은 다음과 같다.

id	위임 가능성	권장 조치
29195	가능하나 Codex #29194와 충돌 위험	Codex가 실제 상태 조사, Cowork는 결과 리뷰/운영 체크리스트만 담당
29183	가능	claimed가 6일+ 정체. 회수 후 "배포 체크리스트 문서화"로 재위임
29176	낮음	#29183 중복 실패. 폐기 권장

Cowork 재위임 prompt:

[역할] Cowork — codex 패치 배포 체크리스트·롤백 정책 정리
[배경]
Command #29183이 2026-04-14부터 claimed 상태로 정체. #29176은 동일 목적 실패 작업. Codex가 CMD#29162 안전 패치 계획을 만들면 배포 전 체크리스트와 롤백 정책이 필요하다.

[작업]
1) 배포 전 검증 항목: dry-run, 로그, timeout, output artifact, rollback point
2) 승인 기준: 프로덕션 변경·크론 재시작·외부 발송은 해리 승인 필요
3) 배포 후 모니터링 지표: error count, runtime, queue backlog, output freshness
4) 롤백 절차: 파일 백업 복원, launchd/cron 되돌림, 상태 확인
5) 배포→모니터링→롤백 시나리오 1건 작성

[산출]
/tmp/cowork_codex_patch_release_checklist.md

승인 후 duplicate cancel SQL:

UPDATE bus_commands
SET status='cancelled',
    updated_at=datetime('now','localtime'),
    completed_at=datetime('now','localtime'),
    result_note='duplicate failed checklist task: superseded by #29183 per 260421 ops triage'
WHERE id=29176
  AND status='failed';

29183을 회수하려면 별도 판단이 필요하다. 이미 `claimed` 상태라 강제 변경은 작업자 상태를 덮어쓴다.

5. legacy ops_todos 10건 — 전부 stale

/Users/ron/.openclaw/workspace/ops_multiagent.db에는 단순 스키마의 legacy ops_todos가 남아 있다.

id	status	created_at	title	판정
1	todo	2026-04-05 16:14:32	healthcheck: verify commands [TEMP_TIMEOUT+30m]	stale
2	todo	2026-04-06 02:50:47	게이트웨이 긴급 확인 [TEMP_TIMEOUT+30m]	stale
3	todo	2026-04-06 02:50:47	memory_search 복구 [TEMP_TIMEOUT+30m]	stale
4	todo	2026-04-06 02:50:47	크론 재활성화 준비 [TEMP_TIMEOUT+30m]	stale
5	todo	2026-04-06 07:48:42	Recover external model connection [TEMP_TIMEOUT+30m]	stale
6	todo	2026-04-06 15:39:23	Collect incident snapshots: gateway/bus/worker/cron	stale
7	todo	2026-04-06 15:39:23	Apply Codex model patch / restart gateway	stale
8	todo	2026-04-06 15:39:23	Enable failover & adjust retry policy for external models	stale
12	todo	2026-04-07 13:49:10	Codex 잠금 해제 및 배포 재시도	stale
13	todo	2026-04-10 04:40:51	staging deploy auto-verify failures — investigation+patch	stale

이 DB는 현재 canonical queue와 다른 위치이며, Hermes 이전 후 잔재 가능성이 높다. 바로 삭제하지 말고 읽기 전용 archive로 취급하는 것이 안전하다.

6. stale/폐기 처리 권고 요약

묶음	대상	권고
Ron drift 중복	#29196~#29203	#29204 하나만 남기고 cancel
Codex failed retry 중복	#29169, #29173	#29179에 병합 후 cancel
Cowork failed 중복	#29176	#29183에 병합 후 cancel
Cowork claimed 장기 정체	#29183	회수/재배정 여부 확인 필요
legacy workspace DB todo 10건	id 1~8,12,13	active queue에서 제외. archive/stale 처리
pending JSON Codex	ingest.py retry/backoff	Codex에 재전달 가능
pending JSON Cowork	ontology sync rollback	KPI 3건에는 미포함했지만 별도 high. 후속 큐로 분리 필요

7. 해리가 오늘 직접 보면 되는 것

#29204 drift-recovery 최신 이슈
실제로 해결해야 할 것은 ops_syncer.py 누락과 gateway 중단 감지다.
과거 중복 8건은 폐기 가능.
#29193 / #29194 / #29195 OpenClaw 잔존 launchd 4개 서비스
launchctl list에서 직접 label이 잡히지 않았다.
먼저 plist 실제 존재/label 확인 후 재시작 여부 결정.
ai.hermes.gateway는 건드리지 않는 조건 유지.
#29183 Cowork claimed 정체 회수 여부
2026-04-14부터 claimed 상태라 사실상 stuck 가능성이 높다.

8. 검증 커맨드 원문 요약

sqlite3 /Users/ron/.openclaw/data/ops_multiagent.db '.schema ops_todos'
sqlite3 /Users/ron/.openclaw/data/ops_multiagent.db '.schema bus_commands'

# canonical ops_todos는 KPI 18과 불일치
select coalesce(assigned_to,''), status, count(*)
from ops_todos
where status in ('todo','doing','blocked')
group by 1,2;

# bus_commands는 queued만 18이지만 분포는 Ron15/Codex2/Cowork1
select status,target_agent,count(*)
from bus_commands
where status in ('queued','claimed')
group by status,target_agent;

# dashboard API probe
curl -sS --max-time 3 http://127.0.0.1:3344/api/bus/command-queue?limit=5
# => Failed to connect to 127.0.0.1 port 3344

curl -sS --max-time 3 http://127.0.0.1:3344/api/ops/todos?limit=5
# => Failed to connect to 127.0.0.1 port 3344

9. 자체평가

정확성: 4.6/5 — KPI 18의 단일 원천은 재현 불가였고, 실제 DB/JSON 근거로 재구성했다.
완성도: 4.5/5 — owner별 항목, stale 판정, Codex prompt, Cowork 위임 가능성까지 작성했다.
검증: 4.4/5 — SQLite/schema/log/API/launchctl 확인 완료. 단, DB write는 안전상 미실행.
최소 변경: 5.0/5 — 보고서만 생성. DM/DB 변경 없음.

종합: 4.6/5

Remaining Risks

ops_todos와 bus_commands, legacy workspace DB, pending JSON이 동시에 존재해 KPI 원천이 불명확하다. 같은 queue count 혼선이 반복되면 KPI 산출 SQL/스크립트를 단일화해야 한다.
3344 dashboard API가 연결되지 않아 대시보드 표기와 DB 실측 간 차이를 직접 대조하지 못했다.
drift-recovery가 같은 이슈를 12시간마다 Ron queue에 누적 등록하고 있다. 중복 방지 키가 필요하다.

ops_todos 대기 18건 트리아지