Archive for the ‘Best Practice’ Category

Use FireFox 3D view to diagnose CSS issues

Sunday, January 5th, 2014

It’s been a while since our last post as I’ve been busy with my other blog, homeautomationguru.com, but this little gem was too neat to pass up (thanks Aseem for the tip!).

When diagnosing complex CSS or HTML issues with multiple layers, there’s a nifty little 3D view built into FireFox that allows you to rotate around all the various layers, inspecting the elements that may be causing you problems.

Simply hit F12 to bring up the debugger pane (which, incidentally also opens the dev tools in IE and Chrome as well), then click the “3D View” button:
firefox-3d-view

Turn on Response Time Monitoring in IIS, Apache, or Weblogic

Thursday, May 17th, 2012

You’ve probably got monitors set up for your various portal servers and components, and are using those to track up-time of your servers. And you may even use an awesome tool like New Relic to track response times in addition to up-time.

But, if you don’t have or need anything fancy (or even if you do for that matter), one of the most common tweaks I recommend to customers prior to a Health Check is to turn on Response Time monitoring. By default, application servers like IIS or WebLogic don’t track how long they take to serve up a page, but it’s easy to turn that on for later analysis.

In IIS, you just turn on the “time-taken” header in the logs:

In Apache or other Java App Servers, use a line line this:

LogFormat "%{X-Forwarded-For}i %l %u %t %p \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" %h %I %O %D" combined-withcookie

It’s the %D that logs the time taken in milliseconds – see the Apache Log documentation for details.

Either way, you should get a log line that looks like this:

- - - [14/Feb/2012:01:11:51 -0400] 80 "GET /portal/server.pt HTTP/1.1" 200 843 "http://portal/server.pt?space=Login" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1)" "" 10.10.12.26 710 1133 17099

… where one of the items is milliseconds (or, in Apache’s case, microseconds) that were taken to process and serve the request. This is hugely valuable information for identifying which pages, communities, or content types are slowing you down.

WCI Search Repair Scheduling Nuances

Monday, October 10th, 2011

WebCenter Interaction Search Server can be pretty finicky. It used to be rock-solid in the Plumtree days, but stability has declined a bit since Clustering was added in the BEA Aqualogic days. I don’t even recommend clustering any more if an index is small and manageable enough – I’ve just spent way too much time wiping out and rebuilding indexes due to problems with the cluster sync.

Over time, the database and the search index get out of synch, which is completely normal. In some cases, a custom utility like Integryst’s SearchFixer is needed to repair the index without purging it, but in most cases, a simple Search Repair will do. To schedule a Search Repair, go to Select Utility: Search Server Manager, change the repair date to some time in the past, and click Finish:

The Search Repair process runs when you run the Search Update Job. Just make sure in the log you see the line “Search Update Agent is repairing the directories…”:

Why would you NOT see this? For years I’ve been doing this “schedule search repair to run in the past, kick off Search Update job” procedure, and it always seemed like I had to run the Search Update job twice to get the search repair to actually work. It turns out, though, that you don’t have run the job twice – you just need to wait for a minute after you schedule the search repair to run in the past.

Why? Because when you set the repair to run in the past and click Finish, the date doesn’t STAY in the past – instead, the date gets set to “Now plus 1 minute”. So if you kick off the Search Update job within a minute of changing the search repair date, the repair is still scheduled to run a couple of seconds in the future.

Note that this little date-related nuisance uses the same logic that Search Checkpoint scheduling uses – dates in the past are actually set to “now plus one”. This is significant because it changes the way you should plan schedules. For example, if you want an Automation Server job to run now, and then continue to run every Saturday at midnight, you’d just go into the job and schedule it for LAST Saturday at midnight, repeating every 1 week. Automation Server will see that the job is past-due, run it now, and schedule the next run time for next Saturday at midnight. With Search Repair or Checkpoint manager, if you do this, the date will actually be set to “now plus 1″, and the job will run now. But the next job will be scheduled for 7 days from now, not 7 days from the original “Saturday at midnight” schedule.

Bottom line: for heavy jobs like search repair or checkpoint management, you should schedule them to run in the future at a specific (weekend or off-hours) date rather than some time in the past.

Wall of Shame Rant: EBay

Saturday, June 18th, 2011

Some of you may remember Randy Pausch’s Last Lecture as an inspirational and courageous last lecture in the face of a terminal disease. I remember him as one of my professors at University of Virginia. It was the best course I ever had – CS 305 I think – called Usability something or other.

Our first assignment was to find “unusable doors”. No kidding, that was it: scour the campus, sit outside a door, and count failure rates (pushing instead of pulling, pushing the wrong side, etc.). What an eye-opening experience! We learned to observe using the things engineers like us created and quantify failure rates. We were shocked at the high failure rates in something as ridiculously simple as A DOOR. After all, Pausch said, “doors have been around for 5,000 years, and today’s engineers have yet to master them”.

[Side note: while we quantified door failure rates and submitted our reports, Randy went with a little more "soft" approach to grading, quantifying reports as "The Good, The Bad, and The Ugly". My report - creatively mentioning "the flames of Satan being visible in the door's reflection" - earned "The Ugly" distinction.]

In addition to quantifying failure rates, we learned to experiment with design before, during, and after implementing them. In the intervening years, I’ve decided to write a book on the topic. It is going to be a masterpiece called “The Ultimate Solution to Usability in Everyday Tools, Software, and Life“. But I’ve never gotten past one page. One sentence, in fact. The entire content of my book is going to be a single page, with a single sentence, with three words:

USE IT ONCE.

Recently I was trying to sell some gear on EBay. I had to verify my identity by calling a phone number. Ridiculous form of identity validation, but fine – at least they can KIND OF trace a seller to a phone number – so I played along.

So I called. And for 14 excruciating minutes I was greeted with this:

EBay FAIL

Yes, those aren’t artifacts of the recording – that was a full-on ear rape for 14 minutes of carnival music fading in and out.

Which brings me to my point: EBay has inexcusably failed the litmus test for usability: quantify your results and use it once. If any one of those support representatives had called their own support line, they would have realized that ear-raping the very clients they depend on is not a very good business practice. And making them wait on hold for 14 minutes to use their service doesn’t really cut it in this “need-it-now” world…

So, I implore any and all of you portal system admins: it’s all too easy to focus on a specific aspect of a system like keeping thing running (I find myself in this position more often than I’m comfortable with – not knowing how the system is used but just “keeping it running”). But, if we do something as simple as using it once, we may learn a thing or two about why our end users are so salty when they finally pick up the phone and call to complain.

WCI Analytics Startup Order

Saturday, May 7th, 2011

If you’re using WebCenter Analytics, you’ve no doubt seen this issue before:

The Analytics Context could not be created.  This is typically due to a configuration problem.  Review the Analytics UI log for more information.

While there are many causes for this and many fixes such as re-scripting the security database, sometimes the simplest solution is overlooked: startup order.

When Analytics needs to be (re)started, the services need to be restarted in the proper order:

  1. WSAPI Server.  The API Server provides SOAP access to the portal objects, such as users.
  2. LDAP Directory Service.  The LDAP Directory Service connects to the API Server to surface Plumtree users and groups via LDAP.
  3. Analytics UI.  This is the service that ultimately provides all the fancy reporting, and can’t work without the other two already set up, since it needs to check credentials (which introduces its own set of problems):

As a side note, the Analytics Collector doesn’t require the API or LDAP service.  It simply accepts inbound events such as searches and logins via UDP from the portal and records themto the database.  It’s a good thing that the services are separate, since even if the UI isn’t working, in most cases you can be reasonably confident events are still be recorded and not lost forever.

 

Redux: WCI 10gR3 Installer Errors

Wednesday, November 3rd, 2010

Another Rock Star in the WebCenter Interaction consulting industry, Bill Benac, wrote a blog post years ago, describing a problem with the WebCenter Interaction 10gR3 installers.  I hadn’t worried about it for a long time until it bit me in the ass – after dozens of successful installs and upgrades of the WCI portal, I had never seen the problem he reported.  The problem as he described is that sometimes a portal installer chokes and displays some error like:

Serious errors occurred during your installation.  Click OK and then click through to the end of installation to complete installation and then look at log for WebCenter Interaction in …

Recently, the same error bit me during an ALUI upgrade, and I saw pretty much the same error in the portal, Collaboration Server, and Analytics.  The errors seemed benign so I just ignored them until I realized that the WebCenter Analytics installer hadn’t created the Analytics Collector Service.

It turns out – and I have no idea why I’d never come across this issue with other installs and upgrades – that the WCI installers look for free memory on the host machine.  In some (unknown and unusual) circumstances, it can’t query the Windows OS for free memory, so it defaults to 0.  But 0GB of free RAM is less than what it needs, so the installer chokes.  In Collab and the Portal, the error is at the end of the installation process, so it seems pretty benign, but for Analytics, it gets thrown before the services are created, so you’re boned unless you fix it.

As for fixing it, check out Bill’s Blog Post, but the gist is that you need to set a fixed amount of Virtual Memory to avoid an error like… (more…)

SSL Portlets can’t be accessed in WebCenter Interaction

Sunday, October 10th, 2010

If you had asked me last month if you should install Windows Updates, I’d have said, “without hesitation, it’s a Best Practice to install Windows Updates as soon as possible; I’ve never seen one break portal functionality – whether it was in the Plumtree days, ALUI days, or lately with WebCenter”. 

This month, the answer is: “without hesitation, it’s a Best Practice to install Windows Updates as soon as possible, but make sure to keep track of those updates and keep an eye out for problems when you’re done”.  Generally, I still think they’re safe and don’t warrant a full regression test once you’re done, but for the first time, I’ve come across a Windows Update that breaks a piece of the WCI portal – specifically, portlet requests to SSL-protected Remote Servers.

Fortunately, Oracle’s support center came through on this one, and clearly documents the problem in KB article 1131443.1: “SSL Portlet Communication Fails After Installing Microsoft Recommended Security Update KB968389 [ID 1131443.1]“.  In summary, there are a certain combination of hotfixes that cause SSL connections from the portal to the remote tier, as documented in the KB article and reproduced after the break.

The thing is, the KB article talks about one “real” Microsoft hotfix [KB968389] interacting with two other “unsupported” hotfixes [KB973667 and KB942636].  It talks about removing the two unsupported fixes, but on the system I was experiencing the problems on, those two weren’t actually installed.  But I did see the one hotfix in there, and once I uninstalled that one (and rebooted), the problem went away.

My best guess at this point is that those two hotfixes from Microsoft (unsupported ones that “are intended to be installed only for customers experiencing this problem”) eventually got rolled into an official, supported hotfix with a different number since the Oracle article was published in June 2010.  And Oracle will eventually update the above KB article listing that “official” hotfix number as well.

(more…)

Some musings on passwords

Thursday, August 19th, 2010

Ah, security.  Here we go again.  My thesis in this post is that we all occasionally mistake “complexity” for “security” when choosing passwords – or, as administrators, setting password policy.  An IT administrator who checks every box in the password policy configuration may not be doing much more to secure users’ accounts than his peer who sets a password to “12345“ to “test things out” – and forgets to change it later.  Similarly, an admin who configures passwords to expire every two weeks may be less secure than a more pragmatic one who sets a time limit of 3 or 6 months.

Countless essays, papers, statistical analyses, and blog posts have discussed the topic of passwords (a remarkably rich subject), so hopefully I’m not just adding to the the noise by saying: All too often, I see people forget about the “Threat Matrix” (not related to, well, anything by the same name).

The “threat matrix” is really a multi-dimensional graph of vulnerabilities, responses, and new vulnerabilities caused by those responses.  But for the sake of this post, let’s look at two of the dimensions:

  1. In what ways can a password be circumvented, and
  2. How can you counter those threats in the most effective way

A password can be beaten with a random string attack, dictionary attack, network sniffer, or simply a bad guy dropping by your office after-hours and rifling through your drawers.  A common mistake is thinking that common forms of thwarting some attacks necessarily make ALL attacks less likely.  Almost by definition, decreasing the odds of one attack type increase the odds of another. 

So, to the admins out there setting security policy: consider that the security benefits to increasing password length and complexity requirements do NOT rise linearly with increased length and complexity.  In fact, they drop off pretty quickly.

  • A password that has a requirement of “10 characters, at least one lower-case and one upper-case, one number, one special character, and one ancient greek symbol that doesn’t appear on your keyboard” is NOT a more secure password.  Because, by the time the frustrated user has tried 47 different memorable-but-impossible-to-remember passwords, s/he’s gonna have to write that damned thing down – and we all know THAT isn’t secure.
  • Full l33tspeak is not a secure password strategy.  If every one of your passwords is the l33tspeak version of the username (alidbuser/@l1dbu$3r, contentdb/c0nt3ntdb), it’s not secure.
  • Dictionary attacks against a web site are impractical, and permanently locking accounts as a way to thwart them after 3 failed login attempts is ridiculous.  At very least, if you’re going to lock accounts, have them auto-unlock after 10 minutes.  This makes the effort to even try hundreds of passwords impractical, let alone the millions or billions that would be required for a full dictionary attack.

I think of all the blog posts I’ve written, this may have taken the longest.  I’ve written, re-written, and trimmed pages and pages of text to basically complain about amazingly complex password rules that some clients have in place without even knowing WHY (“because they’re more secure” is not the correct answer). 

As I’ve continually pruned this post so as not to completely bore you, I realize that the Threat Matrix is an important concept that all IT people should consider in all aspects of daily IT work.  There are plenty of real-world scenarios where the matrix of threats and responses are not fully understood, and hopefully we can make light of some of these in future posts. (more…)

Changing the Server Name for Automation Server

Thursday, July 29th, 2010

It’s not without some controversy (OK, “spirited discussion”), but I’ve strongly recommended the use of host files to aid environment portability.  If you’re a believer in this “alias” approach, you’ll find that for some components, it isn’t very obvious how to set up those aliases.  This isn’t quite a host file hack, but serves the same purpose:  when you migrate the database from one environment to the other, you want to avoid having to change as many settings as possible.

One of these settings is the ALUI Automation Server: in “Select Utility: Automation Service”, you get a list of servers running the Automation Service, and can set which administrative folders are associated to which Job (aka Automation) Servers.  If you migrate the portal database between environments, you might have one entry show up for “PRODPORTAL3″ (in prod) and another for “PORTALDEV9″ (in dev).  But then in the dev environment you have to re-register every one of the folders that was associated with the prod folder. 

What if you could just create an alias that worked in both environments?  Fortunately, you can, and the tweak is easy:  Just edit %PT_HOME%\settings\configuration.xml in both environments, and change the value below to be the same thing.  Then, when the automation server in either environment starts up, it’ll look for jobs registered with that same name:

<component name="automation:server" type="http://www.plumtree.com/ config/component/types/automation">
<setting name="automation-server:server-name">
<value xsi:type="xsd:string">WCI-AUTOMATION</value>
</setting>

Oh, and if you’re a “UI” kind of person, you can achieve the same result by changing the name through the Configuration Manager:

Security Reminder: Stay Vigalent!

Tuesday, July 13th, 2010

Government work can be a challenge with all the rules, regulations, and procedures that come with it.  But there’s one thing I have to continually remind myself when dealing with that “way too much paperwork” thing: whether I’m administering a government web site, ALUI portal, or any other web application is that security can and MUST be taken seriously at all times. 

So, consider this a friendly reminder – especially if you’re exposing your portal on the Internet: stay vigalent, and take all threats seriously.  About 18 months ago, I got an alert in the middle of the night that we were out of drive space on a portal server at one of my semi-government clients.  No big deal; it happens all the time.  Only this time it was different.  Overnight, our logs had exploded from roughly 20MB/day to 2GB/day:  something was seriously wrong.  The logs were so big they were hard to even open, but when i did finally crack them open, here’s a snippet of what I found:

Basically, there were GIGABYTES of these requests – someone was scanning our servers, alternating in different object IDs for different spaces, looking for incorrectly secured communities or other portal objects.  They were basically just scanning different activity spaces, making all kinds of semi-random requests with different IDs a couple times a second.

It turned out that these particular baddies weren’t that sophisticated: they were making no effort to conceal their source IPs through some sort of distributed attack, and their algorithm clearly didn’t demonstrate a deep knowledge of how portal URLs are constructed.  And honestly, we were lucky for even finding this attack in the first place because at the time we didn’t regularly audit the logs, and only caught it because of that benign disk space warning.

In the end, we blocked the entire subnet (from China, a notorious hacker hangout), and the attacks stopped.  We should have reported the attempted breach, and I certainly would if it happened again, but I’m sharing this story with a single moral: no matter how “little” you think your site may be or how you think “noone cares about my little corner of the internet”, the bad guys are out there, and they don’t discriminate when they’re looking for victims.

So, take a minute today to check your security settings one more time, and keep an eye on those log files for anything suspicious!