[SunRay-Users] Restart of one server in failover group causes the whole group to have a downtime...
Bob Doolittle
Robert.Doolittle at Sun.COM
Tue Mar 3 15:55:54 EET 2009
Jens Langner wrote:
> Hi,
>
> since several months we are struggling with a serve problem with our
> SunRay servers. AFAIR it started even with the very latest patch for
> SRSS 4.0 and now we are having the issue for all our SunRay Servers
> (also all the latest 4.1 versions).
>
> The problem is, that as soon as I shutdown or restart one server in a
> failover group (no matter if it is a Linux or Solaris one) after some
> seconds all other servers in the group loose their group status and thus
> disconnect all sunrays. Only an immediately issued "utrestart" restores
> the sessions and group status and allows the users to return to their
> sessions.
>
> Here I can perfectly reproduce the problem by shutting down one server
> in a failover group. Afterwards on all other servers in the group the
> "utgstatus" command returns the following information:
>
> root at saturn:~# utgstatus
> Error: Could not get gstatus information from server saturn
>
> Unfortunately, I haven't had the time yet to debug that any further by
> increasing the debug level. But I would like to ask in here if someone
> else have the same trouble and if this is a known issue and/or if there
> is a fix for that strange behaviour.
>
Do you use card registration? Do you delete registrations regularly?
We saw an issue like this some time ago when a large site did frequent
registration deletions. The Sun Ray Data Store's data got fragmented and
the database indexing became poor. This meant that when a server was
shut down, and all Sun Rays connected to it attempted to connect the the
remaining servers, there was a large volume of connections which
resulted in a large volume of SRDS lookups, which took a long time due
to the poor indexing. The Sun Rays eventually timed out, then
reconnecting and making the problem worse by adding more lookups to the
queue. IIRC the lookups may even have starved heartbeat processing,
causing other Sun Rays to disconnect and attempt to reconnect.
This is CR 6540012: "SRDS DBM files need periodic reindexing". Since
we've never encountered the problem again this CR hasn't gotten a high
priority. It would be good to know if this is your problem.
Please let us know if you do card registration and if you frequently
delete old registrations. If so, I can send you a procedure that was
used at the time to re-index SRDS and we can see if that resolves your
problem. If it does we can investigate re-adjusting the priority on that
defect report.
-Bob
More information about the SunRay-Users
mailing list