[SunRay-Users] Restart of one server in failover group causes the whole group to have a downtime...

Bob Doolittle Robert.Doolittle at Sun.COM
Tue Mar 3 15:55:54 EET 2009


Jens Langner wrote:
> Hi,
>
> since several months we are struggling with a serve problem with our
> SunRay servers. AFAIR it started even with the very latest patch for
> SRSS 4.0 and now we are having the issue for all our SunRay Servers
> (also all the latest 4.1 versions).
>
> The problem is, that as soon as I shutdown or restart one server in a
> failover group (no matter if it is a Linux or Solaris one) after some
> seconds all other servers in the group loose their group status and thus
> disconnect all sunrays. Only an immediately issued "utrestart" restores
> the sessions and group status and allows the users to return to their
> sessions.
>
> Here I can perfectly reproduce the problem by shutting down one server
> in a failover group. Afterwards on all other servers in the group the
> "utgstatus" command returns the following information:
>
> root at saturn:~# utgstatus
> Error: Could not get gstatus information from server saturn
>
> Unfortunately, I haven't had the time yet to debug that any further by
> increasing the debug level. But I would like to ask in here if someone
> else have the same trouble and if this is a known issue and/or if there
> is a fix for that strange behaviour.
>   

Do you use card registration? Do you delete registrations regularly?

We saw an issue like this some time ago when a large site did frequent 
registration deletions. The Sun Ray Data Store's data got fragmented and 
the database indexing became poor. This meant that when a server was 
shut down, and all Sun Rays connected to it attempted to connect the the 
remaining servers, there was a large volume of connections which 
resulted in a large volume of SRDS lookups, which took a long time due 
to the poor indexing. The Sun Rays eventually timed out, then 
reconnecting and making the problem worse by adding more lookups to the 
queue. IIRC the lookups may even have starved heartbeat processing, 
causing other Sun Rays to disconnect and attempt to reconnect.

This is CR 6540012: "SRDS DBM files need periodic reindexing". Since 
we've never encountered the problem again this CR hasn't gotten a high 
priority. It would be good to know if this is your problem.

Please let us know if you do card registration and if you frequently 
delete old registrations. If so, I can send you a procedure that was 
used at the time to re-index SRDS and we can see if that resolves your 
problem. If it does we can investigate re-adjusting the priority on that 
defect report.

-Bob



More information about the SunRay-Users mailing list