harbor内置了postgresql 13,因为其运行在容器docker内,发生故障无法启动时,已经不能通过docker run登录了。本文就是在宿主机异常断电后发生故障,导致服务无法启动,invalid primary checkpoint record。
目录导航
现象
宿主机断电重启了,再次启动harbor后,发现harbor-db状态一直是Restarting...
[root@mdw harbor]# docker-compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
harbor-core goharbor/harbor-core:v2.8.2 "/harbor/entrypoint.…" core 2 minutes ago Up 7 seconds (health: starting)
harbor-db goharbor/harbor-db:v2.8.2 "/docker-entrypoint.…" postgresql 2 minutes ago Restarting (1) 49 seconds ago
。。。。。
分析过程
查看postgresql的日志
发现invalid primary checkpoint record 错误
tail /var/log/harbor/postgresql.log
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG: starting PostgreSQL 13.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.0, 64-bit
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG: listening on IPv6 address "::", port 5432
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.517 UTC [1] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.521 UTC [8] LOG: database system was interrupted; last known up at 2024-04-23 09:46:52 UTC
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.597 UTC [8] LOG: invalid primary checkpoint record
May 6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.597 UTC [8] PANIC: could not locate a valid checkpoint record
May 6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.419 UTC [1] LOG: startup process (PID 8) was terminated by signal 6: Aborted
May 6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.419 UTC [1] LOG: aborting startup due to startup process failure
May 6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.463 UTC [1] LOG: database system is shut down
查找pg_resetwal工具
如果没有现成的PG13, 那么可以从容器内部看看。
[root@mdw harbor]# find / -name pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/etc/alternatives/pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/usr/bin/pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/usr/pgsql/13/bin/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/etc/alternatives/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/usr/bin/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/usr/pgsql/13/bin/pg_resetwal
从几个里找能用的,一些是link 且报错的就算了。
pg_resetwal维修
该工具不能用root运行,所以要切换到随意一个用户下
su - XXXXX
pg的本地挂载文件目录,可以在harbor的docker-compose.yml里找到
postgresql:
image: goharbor/harbor-db:v2.8.2
container_name: harbor-db
restart: always
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
- SETGID
- SETUID
volumes:
- /data/database:/var/lib/postgresql/data:z
该目录现有的权限留一下,然后改成我们切换的用户
chown -R XXX:XXX /data/database
运行修复命令
XXXX/pg_resetwal -f /data/database/pg13/
前面找到的工具,如果当前用户没有权限,可以在root下复制一份过来,本地运行。
修复后,记得将目录属主恢复
chown -R OLDXXX:OLDXXX /data/database
重启harbor服务
docker-compost up -d
确认服务是否已经正常了
如果报replication checkpoint has wrong magic 324508367 instead of 307747550
就把checkpoing文件删了
cd /data/database/pg13/pg_logical/
rm replorigin_checkpoint
再启动看看
总结
容器版pg, 因为没有副本,所以数据安全依赖本地磁盘文件。如果运行在主机上,命令都是现成的,可运行在容器上,就绕了几道弯。