Archive for the ‘Content Crawlers’ Category

Update WebCenter Content Crawler and Job Owners

Tuesday, October 26th, 2010

As the portal evolved from Plumtree to ALUI (AquaLogic User Interaction) to WCI (WebCenter Interaction), there’s been a legacy feature that was born with good intentions, but like many things in human physical anatomy, it has survived evolution with little to no value.

This feature is object “owners” for Content Crawlers and Automation Jobs.  I’m sure there was a grand intent at some point for “owners” to mean something, but I haven’t found it yet.

The “owner” is the portal user that is scheduled to run a job.  But if that user is later deleted, the portal doesn’t clean up after itself – and all jobs that the user created are “orphaned” and won’t run, showing an error in PTSpy like:

Sep 12, 2010 10:07:02 PM- *** Job Operation #1 of 1 with ClassID 38 and ObjectID 202 cannot be run, probably because the operation has been deleted.

The fix – which I will condone until I can figure out why jobs need “owners” in the first place – is to just make the owner of all jobs and crawlers the Administrator user. The Administrator user can’t be deleted, and since I haven’t found any problems with running a crawler as this user, you can just do a portal-wide update making this change with the following SQL:


Helpful SQL to determine if you’re being bitten by this “vestigal organ” after the break.


There’s a WCI App For That 4: CardMigrator

Tuesday, September 28th, 2010

In my last post, I talked about the need to update both a cards’ location AND its CRC in the WebCenter Interaction database to migrate cards from one UNC path to another.  Today’s post is about an “App For That“, which is a utility I had written last year but essentially abandoned until Fabien Sanglier’s great tip about the CRC value needing to be changed.

The app is one of those “thrown together” .NET jobs where I was more focused on the need to update tens of thousands of cards for a client, rather than building a pretty and usable UI.  As such there isn’t a whole lot of error checking, and I’m not comfortable sharing the whole code base here – mostly because I’m just embarrassed about how it was hacked together.  But, if you’ve got a need for something like this, drop me a line and hopefully I can help you out or just send you the code as long as you promise not to make fun of me :).

The code is pretty straight-forward:

  1. After entering the connection strings for the API and the DB (since, as mentioned, we haven’t yet found an ALUI / WebCenter API to make the change to the CRC), you click “Load Crawlers”. 
  2. The crawler list shows up in the tree on the left, grouped by Content Source since you’re likely only updating cards based on the NTCWS, and not WWW or Collaboration Server crawlers. 
  3. Clicking on a crawler shows you all the cards associated with that crawler, as well as a bunch of useful metadata. 
  4. From there, you can do a search and replace on the UNC paths for all the cards.  The update process uses the API and Database methods to update the cards and the crawler, so the next time the crawler or Search Update jobs run, no cards are updated since everything matches up – assuming, of course, you’ve already moved the physical files to the new location! 

Some relevant code is after the break; again, drop me a line if you’re looking for more.


Updating the Location of a Crawled Card in WebCenter Interaction

Friday, September 24th, 2010

Much has been written on Content Crawlers and Cards in Plumtree’s Knowledge DirectoryChris Bucchere has done an excellent writeup on creating custom crawlers, and Ross Brodbeck has done the same for cards in the Knowledge Directory.  In fact, as I re-read those two articles, I realize this post addresses open issues in both articles – how to change the location of a card, and what the Location CRC values are within a card.

In the spirit of giving credit where credit is due, today’s post is based an excellent tip I learned recently from Fabien Sanglier, who figured out this little gem long before I did, and I believe had even posted code on his ALUI Toolbox project.

First, a word on crawlers:  basically, WCI’s Automation Server just calls several methods in a crawler to perform the following (I’m heavily paraphrasing here):

  1. Open the root path specified in the crawler’s SCI Page and query for “containers” (a.k.a., folders).
  2. Query that container for all “documents” (a.k.a., cards, which don’t necessarily have to be files).
  3. Recursively iterate through each container and query for the documents within each.
  4. For each document found, query for document signature and document fetch URL.
  5. If the document signature or path has changed, flag the card as changed and refresh it (which could be metadata, file content, or security)

Later, the Document Refresh and Search Update jobs will also use that crawler code to keep track of whether documents have changed in the source repository (by checking the document signature), and whether the document has moved.  If the signature hasn’t changed, the card remains untouched:

Now, let’s say you need to change the path of an NT Crawler because you’re moving those documents elsewhere.  Normally, you’d just move the files and change your crawler’s root path.  The problem with this approach is that the crawler won’t be able to recognize these files as the same ones that are in the Knowledge Directory, because the path has changed.  Consequently, all cards will be deleted and recreated.  This may not be a problem, but if your Content Managers are like any other Content Manager since the Plumtree days, there will be a lot of portlets that link to these documents in their content.  These links will all be broken, because new cards mean new Object IDs, which are part of those URLs (even the “friendly” ones).

The (partial) solution?  Update the paths for the crawlers AND cards through the API, so that the next time the crawler runs, the portal isn’t aware of any changes and doesn’t mess with any of the already-imported cards because the signatures match up.

Here’s the rub, though: not only does the Automation Server check to see if a document’s SIGNATURE has changed (in an NT File Crawler, for example, the signature is just the “last-modified” date), but it also checks to make sure the document’s PATH has changed.  In other words, if a card has an internal path of \\oldfileshare\folder1\mydoc.doc, and you programmatically change the crawler AND the cards to use \\newfileshare\folder1\mydoc.doc, the cards will STILL get wiped out and crawled in as new.  This is because the portal maintains a CRC check of the old document path, so that if it changes, it knows it’s looking at a different document.

Unfortunately, there doesn’t seem to be a way to update this CRC value through the API, so you need to use a direct DB update to make the change.  Below is the code used to generate the CRC and the table where it needs to be updated.  In my next post, I’ll include a more comprehensive listing.

int crca = XPCRC.GenerateCRC64(strCardLocation).m_crcA;
int crcb = XPCRC.GenerateCRC64(strCardLocation).m_crcB;

DbCommand updateComm = oConn.CreateCommand();
updateComm.CommandType = CommandType.Text;
updateComm.CommandText = "update ptinternalcardinfo set locationcrc_a = " + crca + ", locationcrc_b = " + crcb + " where cardid = " + card.GetObjectID();