Updating the Location of a Crawled Card in WebCenter Interaction

Much has been written on Content Crawlers and Cards in Plumtree’s Knowledge DirectoryChris Bucchere has done an excellent writeup on creating custom crawlers, and Ross Brodbeck has done the same for cards in the Knowledge Directory.  In fact, as I re-read those two articles, I realize this post addresses open issues in both articles – how to change the location of a card, and what the Location CRC values are within a card.

In the spirit of giving credit where credit is due, today’s post is based an excellent tip I learned recently from Fabien Sanglier, who figured out this little gem long before I did, and I believe had even posted code on his ALUI Toolbox project.

First, a word on crawlers:  basically, WCI’s Automation Server just calls several methods in a crawler to perform the following (I’m heavily paraphrasing here):

  1. Open the root path specified in the crawler’s SCI Page and query for “containers” (a.k.a., folders).
  2. Query that container for all “documents” (a.k.a., cards, which don’t necessarily have to be files).
  3. Recursively iterate through each container and query for the documents within each.
  4. For each document found, query for document signature and document fetch URL.
  5. If the document signature or path has changed, flag the card as changed and refresh it (which could be metadata, file content, or security)

Later, the Document Refresh and Search Update jobs will also use that crawler code to keep track of whether documents have changed in the source repository (by checking the document signature), and whether the document has moved.  If the signature hasn’t changed, the card remains untouched:

Now, let’s say you need to change the path of an NT Crawler because you’re moving those documents elsewhere.  Normally, you’d just move the files and change your crawler’s root path.  The problem with this approach is that the crawler won’t be able to recognize these files as the same ones that are in the Knowledge Directory, because the path has changed.  Consequently, all cards will be deleted and recreated.  This may not be a problem, but if your Content Managers are like any other Content Manager since the Plumtree days, there will be a lot of portlets that link to these documents in their content.  These links will all be broken, because new cards mean new Object IDs, which are part of those URLs (even the “friendly” ones).

The (partial) solution?  Update the paths for the crawlers AND cards through the API, so that the next time the crawler runs, the portal isn’t aware of any changes and doesn’t mess with any of the already-imported cards because the signatures match up.

Here’s the rub, though: not only does the Automation Server check to see if a document’s SIGNATURE has changed (in an NT File Crawler, for example, the signature is just the “last-modified” date), but it also checks to make sure the document’s PATH has changed.  In other words, if a card has an internal path of \\oldfileshare\folder1\mydoc.doc, and you programmatically change the crawler AND the cards to use \\newfileshare\folder1\mydoc.doc, the cards will STILL get wiped out and crawled in as new.  This is because the portal maintains a CRC check of the old document path, so that if it changes, it knows it’s looking at a different document.

Unfortunately, there doesn’t seem to be a way to update this CRC value through the API, so you need to use a direct DB update to make the change.  Below is the code used to generate the CRC and the table where it needs to be updated.  In my next post, I’ll include a more comprehensive listing.

int crca = XPCRC.GenerateCRC64(strCardLocation).m_crcA;
int crcb = XPCRC.GenerateCRC64(strCardLocation).m_crcB;

DbCommand updateComm = oConn.CreateCommand();
updateComm.CommandType = CommandType.Text;
updateComm.CommandText = "update ptinternalcardinfo set locationcrc_a = " + crca + ", locationcrc_b = " + crcb + " where cardid = " + card.GetObjectID();


Leave a Reply