The Dreaded Collaboration Search Index Rebuild

Does the following screen shot fill you with dread?

If so, you’ve likely one of us who have had the dubious displeasure of having to work with the terribly bad “Collab Reindexing” functionality, where you ask Collab to rebuild the contents of the search index – usually after a migration or upgrade. The problem is that Collab can store hundreds of thousands of documents, and this process tends to fail, fill up disk space, or do various other terrible things like kidnapping your hamster and holding it for ransom. When any of those things happens, you have to just start over from scratch and pray harder next time.

I recently did a WebCenter 10gR4 migration/upgrade because Windows Server 2003 is approaching end of life in July, and we wanted to upgrade to Windows Server 2008 without jumping through any crazy hoops that Oracle wouldn’t support in 10gR3. The 10gR4 support matrix indicates Windows 2008 (x86 and x64) support for WCI 10.3.3, Collab 10.3.3, and Analytics (don’t ask why the versions aren’t the same, or why Publisher is now technically unsupported for those upgrading to 10gR4). I won’t even rant here about how terrible the installation process was, resolving error after error and manually installing services and deploying web apps, but the Collab Search rebuild deserves a special place in hell.

First, let’s get the “good news” out of the way since there isn’t much of it: in Collaboration Server 10gR4, there’s now an option to re-index only the documents that haven’t been submitted to search yet because the process just exploded. Which is super. While I’d prefer the process not “explode” all the time, I’ll take what I can get: restarting the process at document 110,000 out of 230,000 is much better than restarting at document 1.

The bad news is that the process still fails all the time, despite claims of fixes and notes of “re-breakings” (Oracle bug – login required). The logging is terrible (hint: you can tweak some log settings by cracking open the Collab .war file and changing log parameters there, and you can turn on the following in PTSpy – in addition to “Collab” and “SearchService” – to get a more accurate picture of this bizarrely convoluted process: ALUIRemoting, Remoting, dummy, Configuration. When the search rebuild process fails, you can no longer access Collab Administration (because it seems the first thing it tries to do is connect to the Search Service, and that’s dead now). And the errors that show up aren’t even consistent over time. You’re likely to see a ton of different errors after a period of time, usually with the remoting process:
This error seems to be quite common though:

javax.jms.JMSException: Exceeded max capacity of 50

What can you do about this? I changed the transport protocol for the Collab Search Service from TCP to HTTP and it cleared up the problem, although for the life of me I can’t tell you why. Read on for details of how to do this yourself…

This tip is one of the most extreme things I’ve ever done with reverse-engineering crazy WCI code and problems (and I’ve re-signed Oracle code, decompiled tons of code and written countless plugins – so take it with a grain of salt and proceed at your own (unsupported by Oracle) risk.

First, let’s look at the insane process behind the Collab re-build:

  1. User clicks “rebuild” in Collab Admin
  2. Collab somehow knows all the documents currently in the search index and deletes them (unless, in 10gR4, user is just “reindexing documents not already in the search index)
  3. Collab builds a list of documents that need to be submitted to the index. As far as I can tell, this list uses an embedded database and doesn’t go into the “regular” Collab DB (like the portal’s PTCARDSTATUS table), which is why you lose it when you are forced to restart Collab. I could be wrong about this, given the new functionality in 10gR4, but either way a list is created.
  4. For each document in this list of tens or hundreds of thousands, the following is performed:
    1. Collab downloads the document from the Document Repository to its own temp directory: %PT_HOME%\ptcollab\10.3.3\tmp\.
    2. Collab then seems to parse this document and extract tokens/etc., but it’s unclear exactly what it’s doing with the document at this point.
    3. Collab then sends this document and other information via JMS to the Collab Search Service. Not the actual Search Service.
    4. The document is received by the Collab Search Service and ends up in ANOTHER temp directory on the Collab Search Service system, something like %PT_HOME%\common\containers\Tomcat\6.0.16\temp\.
    5. At this point, the document is supposed to be deleted from the Collab temporary folder, as it’s been passed off.
    6. Presumably the Collab Search Service does something else to the document at this point. Unless it really is just there to add to this whole Rube Goldberg machine.
    7. The Collab Search Service then sends the document over a TCP connection using a proprietary search protocol to the actual Search Service.
    8. The actual Search Service stores the document lexicons and the document is officially “indexed” and ready for searching.
    9. The Collab Search Service then cleans up its temp copy of the file.
  5. Once complete, the search index should have an index for each and every (searchable) document in Collab.

When this process goes wrong, it appears to be mostly between steps 4(3) and 4(6) above – the JMS connection. Documents get left in one or both of those temp folders, and Collab just stops responding. Occasionally you’ll see that string of errors shown above about “exceeded max capacity of 50″ or some other JMS error, but generally it just seems to die. I *think* what’s actually dying is the JMS service on the Collab Search Service, and this is coupled with some problem in the Collab code where there’s some sort of uncaught exception. So Collab just stops trying to index documents, and becomes unresponsive because Collab Search has a full JMS message queue somehow. I did a ton of decompiling and testing, and in focusing on this part of the process, I found ActiveMQ is the JMS messaging server being used by Collab and Collab Search (and CNS, which causes a different set of issues related to logging and tying things in a way I have yet to fathom).

By changing ports and using tcpTrace I tried to get a feel for what messages where being sent over the wire. It quickly becomes apparent they’re using the tcp transport Protocol, which explains why Oracle uses the “failover:tcp://SEARCHSERVICE-HOSTNAME:15275″ format (see Oracle KB article – login required). The problem is, this means that binary data is being sent over the wire:

After trying countless things to get this to work like tweaking the connection string for longer timeouts, I decided to try and switch out the transport from TCP to HTTP so I could better monitor the traffic and see what was actually failing. This turned out to be incredibly complicated, but the high-level steps are listed here. You’ll see why this really isn’t for the faint of heart, so if you need help trying this out drop me a line.

  1. I first tried upgrading ActiveMQ within Collab and Search to 4.1.2 from 4.0.x without changing anything else, but that didn’t change the failure rates. Still, it might need to be done to get this all to work. AND you still have to fix and compile classes within ActiveMQ to fix bugs in THAT library set related to the HTTP transport:
    1. Fix bugs in activeMQ
      1. activemq-core-4.1.2.jar: need to recompile and update \org\apache\activemq\util\ByteSequence.class because of this bug.
      2. activemq-optional-4.1.2.jar: need to recompile and update \org\apache\activemq\transport\http\HttpTunnelServlet.class and \org\apache\activemq\transport\util\TextWireFormat.class due to this bug.
    2. Update collab.war
      1. Add optional activeMQ jar files (for HTTP transport code and dependencies)
      2. replace existing activemq.jar with activemq-core-4.1.2.jar
    3. Update searchservice.war
      1. Remove files: activeio-core-3.0-beta4.jar, activemq-console-4.0.2.jar, activemq-core-4.0.2.jar, commons-logging-1.1.jar, geronimo-jms_1.1_spec-1.0.jar
      2. add optional activeMQ jar files and activemq-core.4.1.2.jar
  2. Once you’ve upgraded ActiveMQ on both ends, you can turn on the HTTP transport protocol. This involves configuration settings on both servers:
    1. Collaboration Server: change the connection URL for the Search Service from failover:tcp://wci-collabsearch:15275 to http://wci-collabsearch:15275
    2. Collaboration Search Service: here we want to make a few tweaks to files inside the \WEB-INF\lib\remoting-support.jar file inside of the searchservice.war file. We need to do this to update configuration settings that can’t be changed through the configuration manager:
      1. META-INF/plumtree/openlog – update the following line to use “ALUIRemotingCollabSearch” – makes debugging easier so that Search Service doesn’t log to PTSpy using the same name as CNS and Collab

        - applicationName=ALUIRemoting
      2. META-INF/lingo-server-jca-macro.xml

        - replace 2 instances of “serverBroker” with “serverBrokerSearch” (just to use a unique name from CNS, which already uses “serverBroker”. There likely is no real conflict here given the configuration we’re using.)

        - replace 2 instances of “tcp://localhost” with “”. Note that http://localhost doesn’t work because the transport binds on only the loopback address; need to use to bind on all interfaces.
      3. META-INF/lingo-server-listenercontainer-macro.xml

        - same as above; 3 instances of “serverBroker” and one instance of “tcp://localhost”
      4. META-INF/lingo-server-macro.xml

        - same as above; 2 instances of “serverBroker” and one instance of “tcp://localhost”

So what do we get for all those changes? The ability to monitor the traffic from Collab to Collab Search, as seen here:

In a side effect that I can only describe as “miraculous”, this change seems to prevent the Collab reindexing process from ever failing in the first place. Why? I honestly don’t know; I was just trying to switch to this transport layer to see if there were any error messages being transmitted between the servers. My guess is that there’s some sort of bug in the TCP transport layer for ActiveMQ, exacerbated by a bug in the Collab stack that doesn’t handle those failures gracefully. The ActiveMQ transport change doesn’t fix the Collab bug but avoids that issue entirely by avoiding the transport exceptions.

Good luck!

Tags: , ,

Comments are closed.