南大通用GCDW技术栈 harbor-db容器版本postgresql断电后重启状态一直是Restarting

harbor内置了postgresql 13,因为其运行在容器docker内,发生故障无法启动时,已经不能通过docker run登录了。本文就是在宿主机异常断电后发生故障,导致服务无法启动,invalid primary checkpoint record。

现象

宿主机断电重启了,再次启动harbor后,发现harbor-db状态一直是Restarting...

[root@mdw harbor]# docker-compose  ps
NAME                IMAGE                                COMMAND                  SERVICE             CREATED             STATUS                            PORTS
harbor-core         goharbor/harbor-core:v2.8.2          "/harbor/entrypoint.…"   core                2 minutes ago       Up 7 seconds (health: starting)
harbor-db           goharbor/harbor-db:v2.8.2            "/docker-entrypoint.…"   postgresql          2 minutes ago       Restarting (1) 49 seconds ago
。。。。。

分析过程

查看postgresql的日志

发现invalid primary checkpoint record 错误

tail  /var/log/harbor/postgresql.log
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG:  starting PostgreSQL 13.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.0, 64-bit
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.515 UTC [1] LOG:  listening on IPv6 address "::", port 5432
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.517 UTC [1] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.521 UTC [8] LOG:  database system was interrupted; last known up at 2024-04-23 09:46:52 UTC
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.597 UTC [8] LOG:  invalid primary checkpoint record
May  6 15:54:30 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:30.597 UTC [8] PANIC:  could not locate a valid checkpoint record
May  6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.419 UTC [1] LOG:  startup process (PID 8) was terminated by signal 6: Aborted
May  6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.419 UTC [1] LOG:  aborting startup due to startup process failure
May  6 15:54:31 172.22.0.1 postgresql[234137]: 2024-05-06 07:54:31.463 UTC [1] LOG:  database system is shut down

查找pg_resetwal工具

如果没有现成的PG13, 那么可以从容器内部看看。

[root@mdw harbor]# find / -name pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/etc/alternatives/pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/usr/bin/pg_resetwal
/var/lib/docker/overlay2/967167c094b2eedecfac2671c11a307414b15d23ace87880cfc24151934afed9/diff/usr/pgsql/13/bin/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/etc/alternatives/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/usr/bin/pg_resetwal
/var/lib/docker/overlay2/de59cc480d55a185dba6a93723e38e31316735da79a62827a1c26e1a82517a0c/merged/usr/pgsql/13/bin/pg_resetwal

从几个里找能用的,一些是link 且报错的就算了。

pg_resetwal维修

该工具不能用root运行,所以要切换到随意一个用户下

su - XXXXX

pg的本地挂载文件目录,可以在harbor的docker-compose.yml里找到

  postgresql:
    image: goharbor/harbor-db:v2.8.2
    container_name: harbor-db
    restart: always
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE
      - SETGID
      - SETUID
    volumes:
      - /data/database:/var/lib/postgresql/data:z

该目录现有的权限留一下,然后改成我们切换的用户

chown -R XXX:XXX /data/database

运行修复命令

XXXX/pg_resetwal -f /data/database/pg13/

前面找到的工具,如果当前用户没有权限,可以在root下复制一份过来,本地运行。

修复后,记得将目录属主恢复

chown -R OLDXXX:OLDXXX /data/database

重启harbor服务

docker-compost up -d

确认服务是否已经正常了

如果报replication checkpoint has wrong magic 324508367 instead of 307747550

就把checkpoing文件删了

cd /data/database/pg13/pg_logical/
rm replorigin_checkpoint

再启动看看

总结

容器版pg, 因为没有副本,所以数据安全依赖本地磁盘文件。如果运行在主机上,命令都是现成的,可运行在容器上,就绕了几道弯。