Archive for the ‘Logs’ Category

Validate WCI Search Server performance

Wednesday, June 27th, 2012

WCI Search is a fickle beast. In my opinion, it started as one of the most stable pieces of the Plumtree stack and has been losing steam since the AquaLogic days with Clustered Search. Nowadays, it seems even a slight breeze will bring the damn thing down and either needs to be restarted or rebuilt.

But, if you’re getting reports that “search is slow”, how do you quantify that? Sure, something like Splunk would be useful for event correlation here, but we can build a pretty clear picture with what we’ve already got. And, I’m not talking about WebCenter Analytics – although that is possible too by querying the ASFACT_SEARCH tables.

Instead, let’s try to get the information from the log files. First, if you look at an entry for an actual search query in the search log files at \ptsearchserver\10.3.0\\logs\logfile-trace-*, you’ll see lines like this:

See it? There’s a line there that looks like this:

<request client="" duration="0.015" type="query">

So although there is no timestamp on that line, if we can isolate that line over a period of time and parse out the “duration” value, we can get a sense of whether the search server is slowing down or if we need to look elsewhere.

I did this using my old friend UnixUtils, and ran a simple command to spin through all the search log files and just find those lines with the phrase type=”query” in it:

cat * | grep "type=\"query\"" > ..\queryresults.txt

The “duration” value can then be parsed out of those lines, and we can produce a nice purdy chart for the client that says “indeed, it seems searches that used to be taking 100ms are now running around 1s, so we should dig into this”:

There’s an App For That 10: LogAnalyzer

Monday, June 11th, 2012

A while back I suggested that you turn on response time monitoring in your IIS or Apache instances. But what good are log files if you can’t make sense of them?

If you’ve got the budget, I totally recommend Splunk for your data processing needs, but today’s App For That is a home-brew log parsing system that simply processes server logs and breaks down the response times by a time interval (such as 1 hour) and type (such as portal gateway requests or imageserver requests), calculating the number of hits and response time during that interval.

It’s a pretty simple .NET application that can process a large number of log files:

It also allows you to export that data for any particular filter to a .tsv so you can produce charts like this and really tease out some significant information:

We’ll be open-sourcing it soon, so stay tuned – or drop us a line and we’ll share it with you now.

Cool Tools 23: Splunk

Monday, June 4th, 2012

Chances are you’re drowning in log data, and aren’t really getting the most out of the information you’ve already got readily stored away on your portal servers. There really is a treasure trove of log information already there, but unless you’ve detected a problem with your infrastructure, you likely don’t even look at that data.

Splunk is touted as “the Engine for Machine Data” and is an incredibly powerful tool for analyzing logs and other activity within your infrastructure in real-time. It has a powerful search capability, the ability to display dashboards, and reporting capabilities that can alert you to problems before they become outages.

It works by indexing data from virtually any source that you have – files, event logs, email, etc. – and allowing you to query or produce all kinds of amazing information.

But don’t take my word for it – check out their site excellent YouTube channel for all kinds of great videos explaining just how powerful a tool this is.

Turn on Response Time Monitoring in IIS, Apache, or Weblogic

Thursday, May 17th, 2012

You’ve probably got monitors set up for your various portal servers and components, and are using those to track up-time of your servers. And you may even use an awesome tool like New Relic to track response times in addition to up-time.

But, if you don’t have or need anything fancy (or even if you do for that matter), one of the most common tweaks I recommend to customers prior to a Health Check is to turn on Response Time monitoring. By default, application servers like IIS or WebLogic don’t track how long they take to serve up a page, but it’s easy to turn that on for later analysis.

In IIS, you just turn on the “time-taken” header in the logs:

In Apache or other Java App Servers, use a line line this:

LogFormat "%{X-Forwarded-For}i %l %u %t %p \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" %h %I %O %D" combined-withcookie

It’s the %D that logs the time taken in milliseconds – see the Apache Log documentation for details.

Either way, you should get a log line that looks like this:

- - - [14/Feb/2012:01:11:51 -0400] 80 "GET /portal/ HTTP/1.1" 200 843 "http://portal/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1)" "" 710 1133 17099

… where one of the items is milliseconds (or, in Apache’s case, microseconds) that were taken to process and serve the request. This is hugely valuable information for identifying which pages, communities, or content types are slowing you down.

Prevent Analytics From Logging Gigabytes of Data

Wednesday, May 25th, 2011

This one comes courtesy of Fabien Sanglier, a WCI guru of epic proportions (no, he didn’t make me say that…).

At our client site, we noticed that Analytics was logging gigs worth of data to the weblogic.out file – a catch-all file for logging reported via the standard out pipe. Note that IIS doesn’t record logs on this stream, so this is a Weblogic-only problem. The line recorded over and over (one for each hit):

OPENUSAGE(INFO ): ==> Event sent to: //UsageTracking@openusage/ wci-analyticscollector| port=31314/ BYTEARR

To prevent these events from being logged, he suggests updating %PT_HOME%/settings/configuration.xml to reduce logging and turn off console logging:

<component name="analytics-communication:config" type="">
<setting name="logging:level">
<value xsi:type="xsd:string">WARN</value>
<setting name="console-logging:enabled">
<value xsi:type="xsd:boolean">false</value>



Security Reminder: Stay Vigalent!

Tuesday, July 13th, 2010

Government work can be a challenge with all the rules, regulations, and procedures that come with it.  But there’s one thing I have to continually remind myself when dealing with that “way too much paperwork” thing: whether I’m administering a government web site, ALUI portal, or any other web application is that security can and MUST be taken seriously at all times. 

So, consider this a friendly reminder – especially if you’re exposing your portal on the Internet: stay vigalent, and take all threats seriously.  About 18 months ago, I got an alert in the middle of the night that we were out of drive space on a portal server at one of my semi-government clients.  No big deal; it happens all the time.  Only this time it was different.  Overnight, our logs had exploded from roughly 20MB/day to 2GB/day:  something was seriously wrong.  The logs were so big they were hard to even open, but when i did finally crack them open, here’s a snippet of what I found:

Basically, there were GIGABYTES of these requests – someone was scanning our servers, alternating in different object IDs for different spaces, looking for incorrectly secured communities or other portal objects.  They were basically just scanning different activity spaces, making all kinds of semi-random requests with different IDs a couple times a second.

It turned out that these particular baddies weren’t that sophisticated: they were making no effort to conceal their source IPs through some sort of distributed attack, and their algorithm clearly didn’t demonstrate a deep knowledge of how portal URLs are constructed.  And honestly, we were lucky for even finding this attack in the first place because at the time we didn’t regularly audit the logs, and only caught it because of that benign disk space warning.

In the end, we blocked the entire subnet (from China, a notorious hacker hangout), and the attacks stopped.  We should have reported the attempted breach, and I certainly would if it happened again, but I’m sharing this story with a single moral: no matter how “little” you think your site may be or how you think “noone cares about my little corner of the internet”, the bad guys are out there, and they don’t discriminate when they’re looking for victims.

So, take a minute today to check your security settings one more time, and keep an eye on those log files for anything suspicious!