Random thoughts from an unusual company

Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Gabriella Davis  4 August 2011 14:43:02
Time for another Sametime blog post.  I've been installing Sametime servers in various configurations for a few months now and in between I've been working with Marie and Tom on the Sametime Admin Guide which is complete in its first draft and in the revision stage.  In its new incarnation Sametime isn't the easiest product to troubleshoot and I thought it would be useful to share with you some problems I've found and how I tracked them down.  In many cases I should have opened support calls with IBM to resolve this stuff but I never had the time to do that, everything needed to be up and working and there wasn't the luxury of spending a few days following through support calls. I'm sorry about that because I know it's easier to identify a problem when it still exists but hopefully this will help someone else on the same path.  I'll also cross-post to the Sametime forum on developerworks.

Incident 1: Sametime on Linux Security Lockout
When installing the Sametime System Console on Linux (so far confirmed with both Redhat and SLES) I have discovered that once you get to the point of exporting the LTPA token to set up SSO, on the next SSC restart you will no longer be able to log in with your admin credentials.  I found an old IBM technote reference to enabling global security on portal with linux which had the same issue (and which I can't find now!).  I was not only able to repeat this on different installs, but I've had 3 other companies email and ask about the same thing.  The only reliable "fix" is to turn off security for the SSC so you just login using the user name (ouch) or, my preference, stop using linux for the SSC at least.

Incident 2: IBM's documents on turning off security
This document has the syntax you need for turning off security so you can login if you lose your credentials (or if incident 1 occurs).  Unfortunately nowhere is there documentation to turn it back on again.  I ended up using the following on the deployment manager install:

wsadmin -username -password (note I'm not using 'conntype none' here)
then type "securityon" on the prompt to turn security back on

Note in both these cases the servers will need restarting for the settings to take effect.

Incident 3: Server install on linux cannot connect to SSC deployment manager to pick up profile
When completing a linux install of Sametime Proxy or Meeting or Media server, etc, you would usually first create a deployment plan in the SSC then, during the install of your new Sametime component, you would connect back to the SSC and pick up that deployment plan.  The default connection assumes the use of SSL and port 9443.  This works fine for Windows but on some linux installs I received "server not responding" when trying to connect to the deployment manager.  I stopped iptables, checked the hostname of the SSC was pingable from the new machine I was attempting to install onto, checked 9443 was listening on the SSC server, etc, but I still got the same error.  Eventually I unchecked the "use SSL" checkbox and connected using port 9043 instead and it worked straightaway.  Since the certificate shipped with the SSC is an internally-generated IBM certificate that isn't recognised by most browsers, I believe the linux install was refusing to connect using it, whereas Windows was much less stringent.

Incident 4: Making sure you know what hostname you install the SSC as
The SSC is the first component we install in most new Sametime implementations and during the install it uses the hostname of the machine you are on to create itself.  Even if you have another FQHN that is resolvable to the SSC box, when trying to complete an install of another component such as the Meeting Server and connect to the SSC deployment manager to do so, the connection will want to use the hostname (box name) that you originally installed the SSC as.  This must be resolvable from the machine you are installing from.  If in doubt check the logs on the installing machine to verify what hostname it is trying to use to connect to the SSC.

Incident 5: Recreating the SSC
I'm sure there must be an easy way to re-use an existing STSC database for a rebuilt or moved install so all your configuration is already in place but avoid having the SSC installer itself assume it needs to create a new database connection as it installs.

My scenario was that I was building each Sametime component and the DB2 server all on different machines, everything virtualised and with snapshots.  I build DB2 which works fine.  I build the Domino server for LDAP.  I build the Domino install which will host the Community Server.   I create the DB2 database for the System console on the DB2 server.  I build the System Console (SSC) .  I start my work setting up the SSC to connect to LDAP and generate a deployment plan for the Community Server and then I install the Community Server.  At this point I've built 5 servers and I realise the SSC is running out of disk space and throwing errors.  The virtual machine build only had 35GB of disk and it simply ran out.  I could have tried to add more disk but I'm not a hardware girl and it seemed simplest to roll back the SSC leaving everything else in place and rebuild it pointing to the same SSC DB2 database containing all the configuration I'd already done.  So I do a new clean install of the SSC but I don't create a new DB2 database for it because there is already one still on the DB2 server with the name I want.  All appears to work fine, I log into the SSC and am delighted to see my Community Server, LDAP, Deployment plans, etc, still in place.  Then I see under DB2 databases in the SSC, two entries for STSC (the SSC database name).  So somewhere during the SSC install it created a pointer to the database STSC on the db2 server but there was already a pointer from the earlier install so now I have two.  Neither can be removed.  Neither work.  If I try and edit or modify either I get errors.  I ended up having to delete the STSC database on the DB2 server and rolling back, creating the configuration all over again.

Incident 6:  In the SSC some components can no longer be accessed
Install your servers, most components having their own machines.  At some point you go into the SSC and choose your Meeting Server but instead of coming up with your meeting server you get a WAS error with "portlet not installed".  The Meeting Server itself continues to run, start, stop etc fine.  Meetings work.  Policies can still be setup and applied. But the SSC interface can no longer show Meeting Server management no matter how many reboots are tried.  Errors in logs report missing portlet.  No changes were made to the environment which had been running a couple of weeks before this occured.  So far only seen on linux installs.

Incident 7: Meeting Server kind of stops working but is still working
This was 11hrs of my life this week.  My meeting server which was built on Windows 2008 in early June, stopped working. The server still started and stopped OK. It showed no errors in any logs but if I attempted to attend a meeting from any client machine the meeting would open in the rich client then drop out within 10 seconds with the helpful error that "You cannot join the meeting at this time".  I created new meetings , same error.  I checked the DB2 database, meetings are being created OK.  I use the Web interface, the meeting opens in the browser and appears OK but the big clue is no awareness in the participant list, and an error if I try and upload any files.  Usually online awareness not working in meetings is a problem with DNS or the proxy server so I run down that path for a bit.  My client logs show a 503 error and suggest I turn on .com.ibm.rtccore=finest logging as detailed in this document.  I do that but on every client machine where it's enabled, the error continues along with the request to turn on that logging.  It simply isn't being picked up.  Eventually I track it down by connecting directly to the Meeting Server on port 9082 instead of to its WAS Proxy on port 443.  The WAS proxy works, that's how we were logging in and doing everything but by connecting directly the error goes away and I discover there are no problems with the Meeting Server itself, just its proxy.

Every product has its teething problems and there is so much new, and so much good stuff in 8.5.2 I'm not surprised I've found a few things.  I should also mention the above is the result of over 30 installs, many of which were error-free so don't be disheartened. I share this because if you find these problems, well it's good to know it's not just you isn't it :-)

I'm going to do my next blog post on reading WAS logs and how to hunt down errors.  It's something I do a lot of and I think some people will find it useful.




Comments

1Steve Pitcher  05/08/2011 13:45:46  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

I've seen some oddities in 8.5.2...battling them now.

I can't find any info on this one:

Public group in 8.5.2 ST client shows as "Public Group Subscription is Pending" and loads 0 contacts. I roll back to 8.0.2 client and the public group works. See that yet?

2Keith Brooks  05/08/2011 15:17:30  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

#7 I experienced as well with Sametime servers and it came down to an incorrect FQDN reference that was not matching the DNS or server name at the time.

Since all points must be equal, one was not, took a while to find it too.

Good work around you did.

3Gab Davis  05/08/2011 16:40:22  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

@Steve are you newly LDAP or have you always been?

@Keith - sadly not that. The server had been running for over 2 months with no problems when suddenly it misbehaved.

4Steve Pitcher  07/08/2011 03:56:23  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Just converted the Sametime server to use LDAP actually. I figure most of my problems are due to that! LOL

5Gab Davis  08/08/2011 00:03:04  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

@Steve - i'd check your LDAP document and make sure your group name attribute is correct and I assume you've gone through this doc?

{ Link }

Also LDAP demands from a Sametime are quite high so you need to make sure the Directory Server you are using can handle that. Sametime over LDAP also has issues with nested groups, especially heavily nested groups as it puts a big demand on the Directory Server which is set by default to start delaying requests if too many are sent at once. See this document { Link }

I hope that helps

6Ryan Desjardins  17/08/2011 15:34:03  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Hi Gabriella - the IBM Support & Dev teams reviewed the feedback you provided and wanted to respond to the best of our ability, given the information available, and to post here for the sake of rediscovery if any of your readers encounter similar issues. Thanks for taking the time to write these up and don't hesitate to reach out to me if you have any questions!

-Ryan

Incident 1: Sametime on Linux Security Lockout

Response: We are not familiar with this issue based on the description given, and is something we haven't seen internally using what we believe to be the correct steps. If we can get more information on where exactly in the install steps this problem was seen - and ideally, logs from the SSC DM and App Profile directories - we can make sure this is being covered in our own SSO testing.

Incident 2: IBM's documents on turning off security

Response: Thank you for pointing out this gap - we have updated the documentation here with your comment: { Link }

For your reference (and anyone who is also reading along), for any of our documentation on the Sametime wiki, if you do find mistakes or omissions - while we hope that it is infrequent - please feel free to comment directly in the wiki and we can get things updated.

Incident 3: Server install on linux cannot connect to SSC deployment manager to pick up profile

Response: As you've noted, you have to be using the hostname for which the certificate was created, aliases will not work. If we can get more information including your specific steps to reproduce, we can also clarify the error message to include information to remind/educate users that they must use the appropriate host name.

Incident 4: Making sure you know what hostname you install the SSC as

Response: As in Incident 3, we are going to examine clarifying the error message and deployment summary panel in the SSC to include information to remind/educate users that they must use the appropriate FQDN. We have created SPR WHER8KSK6L which is a request to clarify this in an upcoming release.

Incident 5: Recreating the SSC

Response: As the IBM team discussed this issue, we are considering a two pronged approach to this issue:

First - we will look at ways to reuse the SSC data in a reinstallation scenario, so that if the data exists in DB2 already, we can take advantage of that.

Second - for a completely clean environmental reset (in which case, the existing data in DB2 shouldn't be used) - we will investigate ways to instruct the admin during the install about how to drop the database, allowing them start from scratch. Similarly, instructions can be included on the uninstall summary and documentation.

Incident 6: In the SSC some components can no longer be accessed

Response: There are several possible causes for this, the most common 2 being after a cluster creation activity or after a failed CELL installation on the same physical host that the Sametime System Console is installed on. This can also happen if something has corrupted the ISC application deployment.xml file or directories. We have recently published Technote 1508641, and created a corresponding hotfix (TPAE-8KM395), which will resolve our known causes for this issue. TN link: { Link }

Incident 7: Meeting Server kind of stops working but is still working

Response: Is this problem still left unsolved? We'll need more information and suggest opening a PMR if so. Our concern would be that connecting directly to a Meeting server (in a clustered environment) and bypassing the WAS HTTP Proxy would lead to unintended consequences in the environment.

7Gab Davis  21/08/2011 23:40:35  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Thanks Ryan. Cross posting my answer from the Sametime Forum.

Incident 1: Sametime on Linux Security Lockout

Response: We are not familiar with this issue based on the description given, and is something we haven't seen internally using what we believe to be the correct steps. If we can get more information on where exactly in the install steps this problem was seen - and ideally, logs from the SSC DM and App Profile directories - we can make sure this is being covered in our own SSO testing.

I have seen it twice on Linux and have had other people contact me when they run into the same thing. Unfortunately I had to rebuild the environment so the logs are gone but the process in each case was the same.

1. Build the entire ST environment including all components federated into a single cell. Everything works fine but authentication to the meeting server automatically on logging into the community server doesn't because SSO isn't enabled.

2. Go into the ISC of the Cell (the one hosting the SSC) and under Global Security choose SSO and generate LTPA keys. Export those keys to a file and import them into a SSO document in Domino.

3. Restart Domino as well as every single WAS server, node agent and deployment manager.

At this point everything seems to be working fine however when you go to login to the SSC again, it won't take the credentials that worked perfectly well before. If you then try and stop the servers, they won't stop because the credentials you supply that have always worked, no longer do. Disabling security fixes the problem but enabling security brings it back again. My only option has been to either leave security disabled or rebuild on Windows. I hope that helps

Incident 3: Server install on linux cannot connect to SSC deployment manager to pick up profile

Response: As you've noted, you have to be using the hostname for which the certificate was created, aliases will not work. If we can get more information including your specific steps to reproduce, we can also clarify the error message to include information to remind/educate users that they must use the appropriate FQDN.

Actually this problem related to the connection to the deployment manager on the secure port 9443 which fails on a linux install. The installer returns "cannot connect to deployment manager" (I assume because it doesn't like the IBM certificate) and the install will only continue if I uncheck "use SSL" on the install screen

Incident 5: Recreating the SSC

Response: As the IBM team discussed this issue, we are considering a two pronged approach to this issue:

First - we will look at ways to reuse the SSC data in a reinstallation scenario, so that if the data exists in DB2 already, we can take advantage of that.

Second - for a completely clean environmental reset (in which case, the existing data in DB2 shouldn't be used) - we will investigate ways to instruct the admin during the install about how to drop the database, allowing them start from scratch. Similarly, instructions can be included on the uninstall summary and documentation.

Great. For what it's worth it would be good to have instructions on removing a profile using manageprofiles too. Sometimes the uninstaller won't' complete and everything has to be manually removed.

Incident 6: In the SSC some components can no longer be accessed

Response: There are several possible causes for this, the most common 2 being after a cluster creation activity or after a failed CELL installation on the same physical host that the Sametime System Console is installed on. This can also happen if something has corrupted the ISC application deployment.xml file or directories. We have recently published Technote 1508641, and created a corresponding hotfix (TPAE-8KM395), which will resolve our known causes for this issue. TN link: { Link }

Great news thanks. The corruption sounds likely which is good because knowing that means I can replace the configuration files from backup if I have to.

Incident 7: Meeting Server kind of stops working but is still working

Response: Is this problem still left unsolved? We'll need more information and suggest opening a PMR if so. Our concern would be that connecting directly to a Meeting server (in a clustered environment) and bypassing the WAS HTTP Proxy would lead to unintended consequences in the environment.

Understood. Actually restarting the HTTP Proxy did resolve the issue temporarily but if it occurs again I will open a PMR.

thank you

8Ryan Desjardins  24/08/2011 18:38:33  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Hi again,

A few last points to close this out:

Incident 1:

The steps you provided appear correct and appear to match what we've tried internally without seeing the error. One thing some folks wondered was if the WAS admin name you attempted to use already exists in the LDAP repositories? In most of our testing, we use a unique name which isn't in LDAP (like the default "wasadmin"), so we are curious if that is causing the problem.

Incident 3:

If you are trying to connect to the same FQHN for which the cert was created, then this would appear to be something new we haven't seen, so we would have to dig into this closer to get to the bottom of it

Incident 5: We have submitted an enhancement request to remove a profile using manageprofiles

For any of these issues, please don't hesitate to reach out to me directly to troubleshoot/investigate further to get more firmed up answers: rdesjardins - at - us.ibm.com

-Ryan

9Gabriella Davis  01/09/2011 14:34:00  Adventures In ST Installing - Lessons I Learned So You Don’t Have To

Just to clear this up / close it off

Incident 1.

No it definitely doesn't exist anywhere else. I use wasadmin and i've done over 30 of these installs. It's only linux that has presented a problem consistently.

Thanks for your efforts with this

Gab