Random thoughts from an unusual company

Octets to the right of me Octets to the left - Sametime Switches Things Up For Giggles

Gabriella Davis  17 June 2011 12:18:19
This week a freakish confluence of events led to me losing a nights' sleep finding and fixing a Sametime behavioural problem.  It could happen to you to, so here's the story and how to avoid it.  

For those of you that work with multiple Sametime servers in your environment (and for the sake of this posting we're talking Domino based Instant Messaging only), you are probably already aware of how a Sametime community is established and communicates.  In very simple terms, any Domino server that has "Yes" marked in the field "is this a Sametime Server" on their server document, will be seen by other  Sametime servers in that Domino Directory as part of the same community.  Sametime servers in a community will try to talk to each other using the fully qualified hostname assigned to the server port in the server document.  In fact you pretty much can't stop them attempting to talk to each other.  If the servers share a directory and can find each other by resolving the FQHN in DNS, they will try and talk.  You can stop them actually being able to talk by using Trusted IPs for example which limits which server another server will accept a connection on but that's a story for another day.  Our story starts like this..

Division A has 2 Sametime servers in the domain, each set to use LDAP (servers must use same directory types either LDAP or Domino to communicate with awareness)
Division B has 3 Sametime servers in the same domain each set to use Domino
 - my task was to get them all seeing each other.  Easy, peasy.  We just convert the Division B servers to LDAP and all should work fine.

So I take down the 3 Division B servers and convert the first one to LDAP. Bring it back online and everything is fine, everyone can log into it, all the buddy lists are fine.  Except the one thing I wanted to achieve wasn't there, the users in Division A still couldn't see the users in Division B and vice versa. Odd.  I check everything

CommunityConfig.txt (which auto builds when the server starts and shows the list of servers in the Community, their FQHN and their ip addresses) is building and showing the correct information for all servers on each of the Division B and Division A servers.  That means that the servers know of each other and know they should be communicating.

Sametime.log shows no errors or even attempts to connect to each other.  Division A servers show connecting to each other.  Division B servers show connecting to each other.  Neither of them show connecting to the other Division.  So we're not dealing with a networking or trusted ips problem. There's no error, it's not even trying.

Much debugging later I am sure of that conclusion.  The servers aren't even trying to connect.  I know the answer to this one! IBM have an algorithm that determines which server should call which other server to ensure the connections are managed and sequenced.  The algorithm is based on the octets of the ip address of each server.  This has been the case for a very long time and i'm used to working with it when using servers with multiple ips or in DMZs.  Essentially as a Sametime server comes online it compares its last octet with each other server in its Community. If its last octet is higher than the last octet of another server, it initiates a connection, if it's lower it waits for a connection.  It's not pretty but it makes sense and works.  

Here's where I hit a problem though.  My calculations showed that the server in Division B which has a last octet of 204 should have been connected to by the Division A servers which had last octets of 221 and 220 respectively and that wasn't happening.  I then brought another Division B server online and it worked perfectly, no problems at all. So what was so special about that server that all the others could talk to it.  Well it did have a suspiciously low numbered ip range across all octets.  That's when I found this gem buried in a technote.

"For 8.0.2 servers the comparison for IP addresses is from left to right"

That's right.  They changed the algorithm from a right to left octet comparison pre 8.0.2 to left to right in 8.0.2 and onwards.  Then I realised

Division A servers were all 8.0.2.  Comparing left to right octets before deciding to initiate a connection
Division B servers were all 7.5.1. Comparing right to left octets before deciding to initiate a connection

The network addresses for Division B servers all had low right octets, so they would never call the Division A servers, and high left octets so the Division A servers would never call them.

My luck and a confluence of versions, networks and ip addresses meant that the two groups of servers would never talk.  Except for one sole, lonely server with a very low series of octets as its ip address that was being happily called by everyone.

The fix was either to change the ip addresses of the Division A servers such that their final right octet was lower than any of the Division B servers, or upgrade the Division B servers so they were all 8.0.2 and using the same algorithm.  We changed the ip addresses, rebooted and everything worked.

7.5.1 goes end of support in September but that's one to watch out for as some of you upgrade as the algorithm applies to 8.5.x as well as 8.0.2.


Comments

1Marcus Tate  18/06/2011 15:54:28  Octets to the right of me Octets to the left - Sametime Switches Things Up For Giggles

Gabriella,

Thanks for posting this! Looks suspiciously very like the answer to the community problems one of my customers has been having....

Marcus