[Pgcluster-general] pgcluster problems after ungraceful shutdown of a replicate server

Chris Price cprice at its.to
Mon Jan 15 03:12:04 UTC 2007


 Scenario Setup;
  
    (config's for servers at end of this email)
    2 cluster db's : db01 & db02
    2 replication servers in cascade config : app01 & app02 (app01 is 
'upper', 02 is 'lower')

    'testdb' created and initialized on db01 (pgcbench -i testdb), 
viewable on db02.
    run pgcbench to put load on cluster from db01 (./pgcbench -c 90 -t 
1000 testdb), transactions viewable on db02's 'testdb' database.

TEST GRACEFUL FAILOVER of repl servers;

 By shutting down pgreplicate on app01. Immediately app02's pgreplicate 
starts handling replication functions. No apparent issues in failover 
from app01 (upper) to app02 (lower). Bring app01 (upper) back into 
service, gracefully shutdown app02 (lower) and app01 starts handling 
replication functions with no apparent issues.

TEST SYSTEM FAILURE of one repl server;

By physically unplugging ethernet cable on app01. app02 DOES NOT start 
handling repl fucntions until 30-120 seconds later. Pgcbench ouput 
ceases on db01, yet app02 (lower repl)   seems to get stuck in a 
'replication loop' attempting to process a set of transactions over and 
over again. Further, database instances on db01 and db02 no longer 
respond to shutdown requests via the 'stop' command. db's must be killed 
with a 'kill -9' sequence. In fact, it seems thru my moderate amount of 
testing that all replication servers and db servers must be shutdown 
(graceful if they take it, kill -9 if they dont) and entire setup 
brought back up from scratch to restore correct cluster and repl server 
operation.

    Thoughts?

    Has anyone tested any sort of disaster related scenarios with 
failure of one pgreplicate server setup in a cascade config? Graceful 
shutdown is all fine and dandy, but I'm worried about whats going to 
happen when I have a server crash, failed network switch or other 'bad' 
event.

    Further, can some kind souls on this list please review my 
configurations and help me shed some light on this problem?

    Chris

db01 cluster.conf

<Replicate_Server_Info>
        <Host_Name> app01 </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Replicate_Server_Info>
        <Host_Name> app02 </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Host_Name> db01 </Host_Name>
<Recovery_Port> 7001 </Recovery_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh </Rsync_Option>
<Rsync_Compress> yes </Rsync_Compress>
<Pg_Dump_Path> /clientdata/pgcluster/bin/pg_dump</Pg_Dump_Path>
<When_Stand_Alone> read_only </When_Stand_Alone>
<Replication_Timeout> 1min </Replication_Timeout>
 
end db01 cluster.conf

db02 cluster.conf
<Replicate_Server_Info>
        <Host_Name> app01 </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Replicate_Server_Info>
        <Host_Name> app02 </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Host_Name> db02 </Host_Name>
<Recovery_Port> 7001 </Recovery_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh </Rsync_Option>
<Rsync_Compress> yes </Rsync_Compress>
<Pg_Dump_Path> /clientdata/pgcluster/bin/pg_dump</Pg_Dump_Path>
<When_Stand_Alone> read_only </When_Stand_Alone>
<Replication_Timeout> 1min </Replication_Timeout>

end db02 cluster.conf

app01 pgreplicate.conf (upper replicator)

 <Cluster_Server_Info>
    <Host_Name>   db01  </Host_Name>
    <Port>        5432                </Port>
    <Recovery_Port>       7001        </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>   db02 </Host_Name>
    <Port>        5432                </Port>
    <Recovery_Port>       7001        </Recovery_Port>
</Cluster_Server_Info>
<Host_Name> app01        </Host_Name>
<Replication_Port>      8001    </Replication_Port>
<Recovery_Port>         8101    </Recovery_Port>
<RLOG_Port>             8301    </RLOG_Port>
<Response_Mode>         normal  </Response_Mode>
<Use_Replication_Log>   yes     </Use_Replication_Log>
<Replication_Timeout>   1min    </Replication_Timeout>
<Error_Log_File> /clientdata/pgcluster/var/log/pgreplicate.log 
</Error_Log_File>
<Log_File_Info>
        <File_Name> /clientdata/pgcluster/var/log/pgreplicate.log 
</File_Name>
        <File_Size> 10M </File_Size>
        <Rotate> 3 </Rotate>
</Log_File_Info>


end app01 pgreplicate.conf

app02 pgreplicate.conf

<Cluster_Server_Info>
    <Host_Name>   db01  </Host_Name>
    <Port>        5432                </Port>
    <Recovery_Port>       7001        </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>   db02 </Host_Name>
    <Port>        5432                </Port>
    <Recovery_Port>       7001        </Recovery_Port>
</Cluster_Server_Info>
<Replicate_Server_Info>
        <Host_Name> app01 </Host_Name>
        <Port> 8002 </Port>
        <Recovery_Port> 8102 </Recovery_Port>
</Replicate_Server_Info>
<Host_Name> app02        </Host_Name>
<Replication_Port>      8001    </Replication_Port>
<Recovery_Port>         8101    </Recovery_Port>
<RLOG_Port>             8301    </RLOG_Port>
<Response_Mode>         normal  </Response_Mode>
<Use_Replication_Log>   yes     </Use_Replication_Log>
<Replication_Timeout>   1min    </Replication_Timeout>
<Log_File_Info>
        <File_Name> /tmp/pgreplicate.log </File_Name>
        <File_Size> 10M </File_Size>
        <Rotate> 3 </Rotate>
</Log_File_Info>

end app02 pgreplicate.conf



More information about the Pgcluster-general mailing list