WebCenter Interaction Search Server can be pretty finicky. It used to be rock-solid in the Plumtree days, but stability has declined a bit since Clustering was added in the BEA Aqualogic days. I don’t even recommend clustering any more if an index is small and manageable enough – I’ve just spent way too much time wiping out and rebuilding indexes due to problems with the cluster sync.
Over time, the database and the search index get out of synch, which is completely normal. In some cases, a custom utility like Integryst’s SearchFixer is needed to repair the index without purging it, but in most cases, a simple Search Repair will do. To schedule a Search Repair, go to Select Utility: Search Server Manager, change the repair date to some time in the past, and click Finish:
The Search Repair process runs when you run the Search Update Job. Just make sure in the log you see the line “Search Update Agent is repairing the directories…”:
Why would you NOT see this? For years I’ve been doing this “schedule search repair to run in the past, kick off Search Update job” procedure, and it always seemed like I had to run the Search Update job twice to get the search repair to actually work. It turns out, though, that you don’t have run the job twice – you just need to wait for a minute after you schedule the search repair to run in the past.
Why? Because when you set the repair to run in the past and click Finish, the date doesn’t STAY in the past – instead, the date gets set to “Now plus 1 minute”. So if you kick off the Search Update job within a minute of changing the search repair date, the repair is still scheduled to run a couple of seconds in the future.
Note that this little date-related nuisance uses the same logic that Search Checkpoint scheduling uses – dates in the past are actually set to “now plus one”. This is significant because it changes the way you should plan schedules. For example, if you want an Automation Server job to run now, and then continue to run every Saturday at midnight, you’d just go into the job and schedule it for LAST Saturday at midnight, repeating every 1 week. Automation Server will see that the job is past-due, run it now, and schedule the next run time for next Saturday at midnight. With Search Repair or Checkpoint manager, if you do this, the date will actually be set to “now plus 1″, and the job will run now. But the next job will be scheduled for 7 days from now, not 7 days from the original “Saturday at midnight” schedule.
Bottom line: for heavy jobs like search repair or checkpoint management, you should schedule them to run in the future at a specific (weekend or off-hours) date rather than some time in the past.
Some of you may remember Randy Pausch’s Last Lecture as an inspirational and courageous last lecture in the face of a terminal disease. I remember him as one of my professors at University of Virginia. It was the best course I ever had – CS 305 I think – called Usability something or other.
Our first assignment was to find “unusable doors”. No kidding, that was it: scour the campus, sit outside a door, and count failure rates (pushing instead of pulling, pushing the wrong side, etc.). What an eye-opening experience! We learned to observe using the things engineers like us created and quantify failure rates. We were shocked at the high failure rates in something as ridiculously simple as A DOOR. After all, Pausch said, “doors have been around for 5,000 years, and today’s engineers have yet to master them”.
[Side note: while we quantified door failure rates and submitted our reports, Randy went with a little more "soft" approach to grading, quantifying reports as "The Good, The Bad, and The Ugly". My report - creatively mentioning "the flames of Satan being visible in the door's reflection" - earned "The Ugly" distinction.]
In addition to quantifying failure rates, we learned to experiment with design before, during, and after implementing them. In the intervening years, I’ve decided to write a book on the topic. It is going to be a masterpiece called “The Ultimate Solution to Usability in Everyday Tools, Software, and Life“. But I’ve never gotten past one page. One sentence, in fact. The entire content of my book is going to be a single page, with a single sentence, with three words:
USE IT ONCE.
Recently I was trying to sell some gear on EBay. I had to verify my identity by calling a phone number. Ridiculous form of identity validation, but fine – at least they can KIND OF trace a seller to a phone number – so I played along.
So I called. And for 14 excruciating minutes I was greeted with this:
Yes, those aren’t artifacts of the recording – that was a full-on ear rape for 14 minutes of carnival music fading in and out.
Which brings me to my point: EBay has inexcusably failed the litmus test for usability: quantify your results and use it once. If any one of those support representatives had called their own support line, they would have realized that ear-raping the very clients they depend on is not a very good business practice. And making them wait on hold for 14 minutes to use their service doesn’t really cut it in this “need-it-now” world…
So, I implore any and all of you portal system admins: it’s all too easy to focus on a specific aspect of a system like keeping thing running (I find myself in this position more often than I’m comfortable with – not knowing how the system is used but just “keeping it running”). But, if we do something as simple as using it once, we may learn a thing or two about why our end users are so salty when they finally pick up the phone and call to complain.
If you’re using WebCenter Analytics, you’ve no doubt seen this issue before:
The Analytics Context could not be created. This is typically due to a configuration problem. Review the Analytics UI log for more information.
While there are many causes for this and many fixes such as re-scripting the security database, sometimes the simplest solution is overlooked: startup order.
When Analytics needs to be (re)started, the services need to be restarted in the proper order:
WSAPI Server. The API Server provides SOAP access to the portal objects, such as users.
LDAP Directory Service. The LDAP Directory Service connects to the API Server to surface Plumtree users and groups via LDAP.
Analytics UI. This is the service that ultimately provides all the fancy reporting, and can’t work without the other two already set up, since it needs to check credentials (which introduces its own set of problems):
As a side note, the Analytics Collector doesn’t require the API or LDAP service. It simply accepts inbound events such as searches and logins via UDP from the portal and records themto the database. It’s a good thing that the services are separate, since even if the UI isn’t working, in most cases you can be reasonably confident events are still be recorded and not lost forever.
Another Rock Star in the WebCenter Interaction consulting industry, Bill Benac, wrote a blog post years ago, describing a problem with the WebCenter Interaction 10gR3 installers. I hadn’t worried about it for a long time until it bit me in the ass – after dozens of successful installs and upgrades of the WCI portal, I had never seen the problem he reported. The problem as he described is that sometimes a portal installer chokes and displays some error like:
Serious errors occurred during your installation. Click OK and then click through to the end of installation to complete installation and then look at log for WebCenter Interaction in …
Recently, the same error bit me during an ALUI upgrade, and I saw pretty much the same error in the portal, Collaboration Server, and Analytics. The errors seemed benign so I just ignored them until I realized that the WebCenter Analytics installer hadn’t created the Analytics Collector Service.
It turns out – and I have no idea why I’d never come across this issue with other installs and upgrades – that the WCI installers look for free memory on the host machine. In some (unknown and unusual) circumstances, it can’t query the Windows OS for free memory, so it defaults to 0. But 0GB of free RAM is less than what it needs, so the installer chokes. In Collab and the Portal, the error is at the end of the installation process, so it seems pretty benign, but for Analytics, it gets thrown before the services are created, so you’re boned unless you fix it.
As for fixing it, check out Bill’s Blog Post, but the gist is that you need to set a fixed amount of Virtual Memory to avoid an error like… (more…)
If you had asked me last month if you should install Windows Updates, I’d have said, “without hesitation, it’s a Best Practice to install Windows Updates as soon as possible; I’ve never seen one break portal functionality – whether it was in the Plumtree days, ALUI days, or lately with WebCenter”.
This month, the answer is: “without hesitation, it’s a Best Practice to install Windows Updates as soon as possible, but make sure to keep track of those updates and keep an eye out for problems when you’re done”. Generally, I still think they’re safe and don’t warrant a full regression test once you’re done, but for the first time, I’ve come across a Windows Update that breaks a piece of the WCI portal – specifically, portlet requests to SSL-protected Remote Servers.
Fortunately, Oracle’s support center came through on this one, and clearly documents the problem in KB article 1131443.1: “SSL Portlet Communication Fails After Installing Microsoft Recommended Security Update KB968389 [ID 1131443.1]“. In summary, there are a certain combination of hotfixes that cause SSL connections from the portal to the remote tier, as documented in the KB article and reproduced after the break.
The thing is, the KB article talks about one “real” Microsoft hotfix [KB968389] interacting with two other “unsupported” hotfixes [KB973667 and KB942636]. It talks about removing the two unsupported fixes, but on the system I was experiencing the problems on, those two weren’t actually installed. But I did see the one hotfix in there, and once I uninstalled that one (and rebooted), the problem went away.
My best guess at this point is that those two hotfixes from Microsoft (unsupported ones that “are intended to be installed only for customers experiencing this problem”) eventually got rolled into an official, supported hotfix with a different number since the Oracle article was published in June 2010. And Oracle will eventually update the above KB article listing that “official” hotfix number as well.
Ah, security. Here we go again. My thesis in this post is that we all occasionally mistake “complexity” for “security” when choosing passwords – or, as administrators, setting password policy. An IT administrator who checks every box in the password policy configuration may not be doing much more to secure users’ accounts than his peer who sets a password to “12345“ to “test things out” – and forgets to change it later. Similarly, an admin who configures passwords to expire every two weeks may be less secure than a more pragmatic one who sets a time limit of 3 or 6 months.
Countless essays, papers, statistical analyses, and blog posts have discussed the topic of passwords (a remarkably rich subject), so hopefully I’m not just adding to the the noise by saying: All too often, I see people forget about the “Threat Matrix” (not related to, well, anything by the same name).
The “threat matrix” is really a multi-dimensional graph of vulnerabilities, responses, and new vulnerabilities caused by those responses. But for the sake of this post, let’s look at two of the dimensions:
In what ways can a password be circumvented, and
How can you counter those threats in the most effective way?
A password can be beaten with a random string attack, dictionary attack, network sniffer, or simply a bad guy dropping by your office after-hours and rifling through your drawers. A common mistake is thinking that common forms of thwarting some attacks necessarily make ALL attacks less likely. Almost by definition, decreasing the odds of one attack type increase the odds of another.
So, to the admins out there setting security policy: consider that the security benefits to increasing password length and complexity requirements do NOT rise linearly with increased length and complexity. In fact, they drop off pretty quickly.
A password that has a requirement of “10 characters, at least one lower-case and one upper-case, one number, one special character, and one ancient greek symbol that doesn’t appear on your keyboard” is NOT a more secure password. Because, by the time the frustrated user has tried 47 different memorable-but-impossible-to-remember passwords, s/he’s gonna have to write that damned thing down – and we all know THAT isn’t secure.
Full l33tspeak is not a secure password strategy. If every one of your passwords is the l33tspeak version of the username (alidbuser/@l1dbu$3r, contentdb/c0nt3ntdb), it’s not secure.
Dictionary attacks against a web site are impractical, and permanently locking accounts as a way to thwart them after 3 failed login attempts is ridiculous. At very least, if you’re going to lock accounts, have them auto-unlock after 10 minutes. This makes the effort to even try hundreds of passwords impractical, let alone the millions or billions that would be required for a full dictionary attack.
I think of all the blog posts I’ve written, this may have taken the longest. I’ve written, re-written, and trimmed pages and pages of text to basically complain about amazingly complex password rules that some clients have in place without even knowing WHY (“because they’re more secure” is not the correct answer).
As I’ve continually pruned this post so as not to completely bore you, I realize that the Threat Matrix is an important concept that all IT people should consider in all aspects of daily IT work. There are plenty of real-world scenarios where the matrix of threats and responses are not fully understood, and hopefully we can make light of some of these in future posts. (more…)
It’s not without some controversy (OK, “spirited discussion”), but I’ve strongly recommended the use of host files to aid environment portability. If you’re a believer in this “alias” approach, you’ll find that for some components, it isn’t very obvious how to set up those aliases. This isn’t quite a host file hack, but serves the same purpose: when you migrate the database from one environment to the other, you want to avoid having to change as many settings as possible.
One of these settings is the ALUI Automation Server: in “Select Utility: Automation Service”, you get a list of servers running the Automation Service, and can set which administrative folders are associated to which Job (aka Automation) Servers. If you migrate the portal database between environments, you might have one entry show up for “PRODPORTAL3″ (in prod) and another for “PORTALDEV9″ (in dev). But then in the dev environment you have to re-register every one of the folders that was associated with the prod folder.
What if you could just create an alias that worked in both environments? Fortunately, you can, and the tweak is easy: Just edit %PT_HOME%\settings\configuration.xml in both environments, and change the value below to be the same thing. Then, when the automation server in either environment starts up, it’ll look for jobs registered with that same name:
Government work can be a challenge with all the rules, regulations, and procedures that come with it. But there’s one thing I have to continually remind myself when dealing with that “way too much paperwork” thing: whether I’m administering a government web site, ALUI portal, or any other web application is that security can and MUST be taken seriously at all times.
So, consider this a friendly reminder – especially if you’re exposing your portal on the Internet: stay vigalent, and take all threats seriously. About 18 months ago, I got an alert in the middle of the night that we were out of drive space on a portal server at one of my semi-government clients. No big deal; it happens all the time. Only this time it was different. Overnight, our logs had exploded from roughly 20MB/day to 2GB/day: something was seriously wrong. The logs were so big they were hard to even open, but when i did finally crack them open, here’s a snippet of what I found:
Basically, there were GIGABYTES of these requests – someone was scanning our servers, alternating in different object IDs for different spaces, looking for incorrectly secured communities or other portal objects. They were basically just scanning different activity spaces, making all kinds of semi-random requests with different IDs a couple times a second.
It turned out that these particular baddies weren’t that sophisticated: they were making no effort to conceal their source IPs through some sort of distributed attack, and their algorithm clearly didn’t demonstrate a deep knowledge of how portal URLs are constructed. And honestly, we were lucky for even finding this attack in the first place because at the time we didn’t regularly audit the logs, and only caught it because of that benign disk space warning.
In the end, we blocked the entire subnet (from China, a notorious hacker hangout), and the attacks stopped. We should have reported the attempted breach, and I certainly would if it happened again, but I’m sharing this story with a single moral: no matter how “little” you think your site may be or how you think “noone cares about my little corner of the internet”, the bad guys are out there, and they don’t discriminate when they’re looking for victims.
So, take a minute today to check your security settings one more time, and keep an eye on those log files for anything suspicious!
The Plumtree Server stack has had a long history, forming a decent patchwork of usable applications, but never quite getting to the point where every part of the stack is consistently configured. When it became ALUI (AquaLogic User Interaction), there was a movement towards putting configuration settings in one place – the Configuration Manager – but unfortunately now that Oracle is holding the reins and the future of the stack is in question, it looks like we’ll never have that utopian vision of single, centralized way of configuring all applications the same way.
Case in point: configuring the memory parameters for Collaboration Server. While Publisher utilizes a config file for memory settings, Collaboration Server passes memory parameters via the service startup path. So, if you’ve got a decently large Collaboration install, you might find that you’re running relatively low on memory:
To up the amount of RAM available to Collaboration Server, you need to edit the registry (and yeah, back it up first!). The key you need to change is in HKLM\ SYSTEM\ CurrentControlSet\ Services\ ptcollaborationserver, and it’s called “ImagePath”:
Change the “-Xmx” value to something larger, restart Collab, and you’ll have more RAM breathing room:
When configuring a WebCenter Interaction portal, it’s highly recommended to use host files on your machines to provide aliases for the various services.
For example, instead of referencing Publisher’s Remote Server as http://PORTALPROD6.site.org:7087/ptcs/, create a host file in C:\Windows\System32\drivers\etc\hosts, and add a line like this:
wci-publisher 10.5.38.12 #IP Address for Publisher in this environment
… then set your Remote Server to http://wci-publisher:7087/ptcs/.
I’m always surprised how many times the knee-jerk reaction to this suggestion is that this is a poor “hack”, or something worse like this:
“Host files??? Host files on local servers need to be avoid and you should use DNS in AD for the Portal servers. Host files, again, are an antiquated and unmanageable configuration in this day and age and, in my opinion, should only be used when testing configurations—not for Production systems. I haven’t seen host files used locally on servers in a decade…is that how you are configuring this portal system? If so, I would highly recommend you try to use the AD DNS instead.”
Yes, that’s an actual response from an IT guy who prefers telling others what idiots they are rather than actually listening to WHY this approach is being used. In all fairness, most knee-jerk reactions are based in the reality that host files are more difficult to maintain on many servers rather than DNS entries on a single server. But hopefully, if you’re reading this blog, you’ve got an open mind, and will agree with this approach once you see the list of benefits below.
Benefits of using host files in your portal environments:
Service Mobility. Take the NT Crawler Web Service, for example. When you crawl documents into the portal, the name of the server is included in the document open URL. Now suppose the NTCWS is giving you all sorts of grief and you decide to move it to another server. If you use host files, you can just install the NTCWS somewhere else and change the IP address that the wci-ntcws alias points to. This way, the portal has no idea the service is being provided by another physical system If you used a machine name, all documents would get crawled in as new the next time you ran the crawler, because the card locations will have changed.
Maintainability. This one’s a pretty weak argument, but is based on the fact that most of the time, the Portal Admin team doesn’t have access to create DNS entries and has to submit service requests to get that done. By bringing “DNS-type services” into host files, the portal team can more easily maintain the environment by shifting around services without having to submit “all that paperwork” for a DNS entry (your mileage may vary with this argument).
Environment Migration. Here’s the clincher! Most of us have a production and a development environment, and occasionally a test environment as well. Normally, code is developed in dev and pushed to test, then to prod, but content is created in prod, and periodically migrated back to test and dev, so those environments are reasonably in synch for testing. This content migration is typically done by back-filling the entire production database (and migrating files in the document repository, etc.). The problem is, all kinds of URLs (Remote Servers, Search, Automation server names, etc.) are stored in this database, so if you’re using server names in these URLs, your dev/test environments will now have Remote Servers that point to the production machines, and you need to go through and update all of these URLs to get your dev environment working again! If, however, you use host files, then you can skip this painful step: your Publisher server URL (http://wci-publisher:7087/ptcs/) can be the same in both environments, but the host files in dev point to different machines than the ones in production. Cool, huh?
Disaster Recovery. This is essentially the same as the “Environment Migration” benefit: When you have a replicated off-site Disaster Recovery environment, by definition your two databases are kept in synch in real-time (or possibly on a daily schedule of some sort). If a disaster occurs and you lose your primary environment, you’re going to want that DR site up as soon as possible, and not have to go through changing all those URLs to get the new environment running with new machine names. Of course, unlike “Environment Migration” (where your dev, test, and prod environments typically share the same DNS server), this argument is also slightly weaker. Since the DR site will likely have its own DNS server, you could conceivably just use different DNS entries at the two different sites and all will work fine.
So that’s it – hopefully you’re convinced that host files are the way to go for configuring ALUI / WCI portals; if so, stay tuned for helpful tips on how to set this up for various servers. While Remote Servers are a no-brainer, configuring things like Automation Server and Search can be a little trickier.