Document ID   Synopsis   Date
ID73104   Troubleshooting loadbalancing on Sun Ray[TM]   5 Mar 2004

 

 


Keyword(s):sunray, Sun Ray, loadbalancing, load balancing, failover

The load is unbalanced between several servers in a Sun Ray[TM] failover group.
Definititions and Abbreviations:
================================
load balancing: the process of distribution Sun Ray sessions over the
   Sun Ray servers in a failover group.

group manager: the part of the authentication manager which is
   responsible for load balancing.

trusted: Sun Ray servers which share a group signature are trusted,
   and are considered part of the same failover group for
   load balancing purposes.

active session: a Sun Ray session where a user is logged in.

idle session: a Sun Ray session waiting at the dtgreet screen,
   or the utselect -L GUI.

SRSS: Sun Ray Server Software.

NSCM: Non Smartcard Mobility (SRSS 1.3 and higher only).

Background:
===========

The load balancing algorithm in the SRSS 1.2 and higher works as follows:
When a token is presented, the authentication manager (utauthd) checks
whether an active session for the token is available on any of the Sun Ray
servers in the failover group. If no active session is available, a
load balancing decision will be made. Idle sessions are ignored at this stage.

For the load balancing decision, various load related parameters and the
server's total CPU power will be combined into a parameter called
"desirability". Then, a weighted random selection is made between all
"online" servers in the same group, where the token is more likely to be
redirected to a server with a higher desirability. Once the token has
redirected to a Sun Ray server according to the load balancing decision, the
authentication manager on this server will check whether there exists an
idle session for the redirected token on this server. Only if no idle
session exists for the redirected token on this server will the
authentication manager initiate a new session. 

The reason to incorporate a weighted random selection into the load
balancing algorithm is to avoid that all sessions end up on the same server
when many users log in simultaneously, say, around 08:30 in the morning
when everybody gets into the office.

There has not been a single critical bug in the Sun Ray load balancing with
the SRSS 1.2 or higher. The typical root causes for poor load balancing are

- A misconfiguration which simply turned off load balancing. See
  section "Checking configuration".

- A Sun Ray server has been turned "offline", and thus
  is ignored during load balancing, except if no "online" server is up.

- A Sun Ray server is in a dysfunctional state where it does not
  accept new sessions, such as utauthd hanging, or being unresponsive.

- A network problem or network misconfiguration. See 
  section "Checking configuration".

- Poor initial load balancing, which is likely when 
  "pseudo terminal" sessions rather than NSCM are used. See section 
  "Load balancing limitations".

- A misunderstanding of what load balancing can achieve. See
  section "Load balancing limitations".

- The EOLed SRSS 1.1 is used. This release provided inferior load balancing.


Checking configuration:
=======================
1) The servers should be running in configured mode, utconfig should
  have been run. Thus, check whether /etc/opt/SUNWut/utadmin.conf exists.

2) When running utconfig to configure a Sun Ray server for failover,
  failover should be selected, and the same group signature must be
  entered for all servers in a Sun Ray failover group.
     [...]
     Configure this server for a failover group? (y/[n])? y
     About to configure the following software products:
     [...]
     Failover group: yes  <----
     [...]
     You have chosen to configure this server for a failover group.

     All servers in a failover group must share a unique signature, 
     which is a string of 8 or more characters where at least two 
     characters are letters and at least one is not.

     Enter signature: 
     Re-enter signature: 
     [...]
  utconfig creates a logfile into /var/adm/log. Check this logfile
  whether failover was selected. 

3) All servers in the failover group must have the same group signature.
  Use utreplica on the primary servers to get the list 
  of Sun Ray servers which are in the same failover group, 
  then check utgstatus output whether servers are trusted. Also
  check that every trusted server visible in utgstatus output
  is listed as part of the failover group in utreplica output. If the 
  group signatures do not match, use /opt/SUNWut/sbin/utgroupsig to fix.

4) The "-g" flag must be set in the policy. Furthermore, the policy
  must essentially be identical across all servers in a failover group.
  On 1.x, check utglpolicy output, and check whether utpolicy output is
  identical to utglpolicy, except possibly for token reader (-t) options.
  On 2.0, utglpolicy is obsoleted, check utpolicy output only.
  Note: on an 1.x Sun Ray failover group, either the admin GUI or 
  /opt/SUNWut/sbin/utglpolicy must be used to change the policy.

5) Sun strongly recommends using Non Smartcard Mobility (NSCM) rather than
  "pseudo terminal" sessions to get good loadbalancing. NSCM is available
  in 1.3 and higher, and can be turned on by the "-M" policy flag. NSCM
  also provides hot desking without the use of smartcards.

6) In /etc/opt/SUNWut/auth.props, ensure that the group manager and
  loadbalancing are not disabled. If the following parameters are
  set, they must have the listed values:

  + enableLoadBalancing = true
  + enableGroupManager = true
  + useLocalPolicy = false (SRSS 1.x only).

  Furthermore, it is strongly recommended that all servers in a
  Sun Ray failover group have identical auth.props files.
  Note: if these values are wrong, this is a strong indication
  that the Sun Ray server was not configured for failover when
  utconfig was run.

7) Check that all Sun Ray servers in the failover group are
  "online". See
    71443  How to check whether a Sun Ray[TM] server is "offline".

8) All Sun Ray interfaces which have Sun Ray appliances connected
  to them must be up and reachable. If a Sun Ray appliance is connected
  to interface a of a Sun Ray server A, and Sun Ray server B cannot
  contact the group manager of Sun Ray server B through this 
  interface a, then Sun Ray server A will not load balance this
  Sun Ray appliance to Sun Ray server B, because it does not know 
  whether the Sun Ray appliance can reach server B. Thus, check 
  utgstatus output whether all Sun Ray interfaces are up and reachable.
  Also check /var/opt/SUNWut/log/auth_log* for "token query timed out"
  messages, such as this:

  01/15/2004 23:52:03 token query timed out to host labhost2 interface
192.168.128.2

  Here, labhost2 was unreachable on interface 192.168.128.2, and thus 
  this interface was ignored during load balancing.
  Note: such an issue is frequently caused by network components,
  like bad firmware, or a bad port on a switch.

9) If different network interfaces are connected to the
  same physical switch, the network interfaces must have
  different ethernet addresses.

10) If a network issue is likely, check "/usr/bin/netstat -in" output for
  errors and collisions, and in 1.3 and higher also collect a few minutes
  of "/opt/SUNWut/sbin/utcapture" output to check for packet loss.

11) All Sun Ray servers in a failover group must run the same SRSS release.


Load balancing limitations:
===========================

Sun Ray load balancing is strictly limited to Sun Ray session creation.
There is no way to move an existing user session to another server. Thus,
once a user has logged in, the user's session will stay on this server
until the session has been exited, or terminated.

Furthermore, the load balancing is completely unrelated to assigning DHCP
addresses to Sun Ray appliances. Load balancing takes place once a Sun Ray
appliance which already has a DHCP address successfully connects to the
authentication manager, requesting a session for the current token.


Example scenarios resulting in a poor distribution of load:
-----------------------------------------------------------
Customer has two Sun Ray servers, and uses "pseudo terminal" sessions
exclusively. When one Sun Ray server is rebooted, all Sun Ray appliances
will connect to the other server, and will get sessions there.
When both Sun Ray servers are rebooted at the same time, 
inevitably one will be up first, and most Sun Ray appliances, if not all,
will connect to this server, and get sessions there. 

TIP: if you use NSCM, the sessions will be created when the
user actually logs in at the NSCM login GUI, rather than when
the appliance initially connects to a Sun Ray server. This late binding
of NSCM will give much better load balancing.



Methods to reduce the impact of poor initial distribution of load:
------------------------------------------------------------------
Generally, when using "pseudo terminal" sessions rather than NSCM, 
after rebooting all servers in a failover group, immediate
actions should be taken to balance the initial load between
the Sun Ray servers. The simplest ones are
- run "/opt/SUNWut/sbin/utpolicy -i soft" from all servers, simultaneously
or
- run "/opt/SUNWut/sbin/utfwsync"

Alternatively, and a little bit more work, the system administrator
can use the "enhanced session management" functionality provided with
1.3 and higher to terminate sessions which are waiting at the 
dtgreet screen. On the server where you want to reduce load, run
"/opt/SUNWut/sbin/utsession -p".
Sessions which are waiting at the dtgreet screen can be identified
because they have an "I" in the last column. If these sessions
are terminated by the administrator, new sessions will be created
for the corresponding tokens, and most of these new sessions
will be on the server which has higher desirability.

Once initial load is unbalanced, the system administrator also can
temporarily prevent Sun Ray servers which are already under a high load
from being assigned any new sessions by the loadbalancing by turning them
"offline", using "/opt/SUNWut/sbin/utadm -f". The server can later be
switched into normal "online" mode using "/opt/SUNWut/sbin/utadm -n". A
server which is offline can be identified by the existance of a file
/var/opt/SUNWut/offline.

Note: a Sun Ray server which is "offline" will still provide the NSCM login
GUI if NSCM is turned on. However, if a user then logs in, loadbalancing is
triggered, and the actual user session will be created on another server.




References:
===========
16733 Why do all my Ethernet interfaces have the same Ethernet MAC address?
71443 How to check whether a Sun Ray[TM] server is "offline".
utgstatus(1M) manual page
utreplica(1M) manual page
utcapture(1M) manual page (SRSS 1.3 and higher only)

Top