Apache Log Parsing
March 2nd Note: See Log Parsing III for the conclusion to this topic.
***
Regular expressions (regex’s) are great for manipulating character strings. I’ve used them for years, but haven’t gone beyond ”beginner” to “intermediate” proficency and admire scripters who’ve mastered complex regular expressions.
***
After developing IPhone Cafe, I wanted to examine traffic patterns using web server access logs provided by Yahoo which runs on Apache. Anyone familiar with Apache knows its logs contain lots of visitor information packed into each line. Examining traffic patterns means parsing or “unpacking” lines. This is a perfect job for regular expressions.
***
Chapter 20 of Perl Cookbook, by Tom Christiansen and Nathan Torkington presents a one line regular expression to extract values from an access log. Perl Cookbook is a great reference loaded with lots of nifty scripts. My copy is dog earred because I routinely reach for it to find scripting solutions. However, I’ve never had success with the one line regex approach for parsing web logs. Must be a mental block on my part.
***
Searching on Google, I found a page titled Parsing Apache Access Logs using PHP which uses a different approach; a multi-line regex approach. Instead of a single one line expression, the author of the PHP page uses multiple expressions to parse the log line from beginning to end. My ISP uses Apache’s Combined Log Format which contains the following elements:
host, datetime, method, url, protocol, status, bytes, referer, client
The multi-line regex approach uses separate regular expressions to match each element, starting with the first element, ’host’, moving through the line until reaching the last element ’client’. As each element is matched, the ‘post-match’, e.g. everthing to the right of the match is stored and used in the next match.
***
The multi-liner approach isn’t nearly as elegant as the one line regex, e.g. it’s a hack; however, for me, it’s easier to work with and debug since it breaks one big “pattern match” into a number of smaller matches which can then be individually tested.
The code sample in the link below isn’t pretty, however it works for me. It’s a Perl subroutine where the first parameter contains a single line from the web server access log. The other parameter is for output and is a reference to a hash defined in the main script [more on references]. Here’s a link to the code:
Here’s a usage example:
my %rec;
&parseLogLine($thisLine, \%rec);
print “IP Address = ” . $rec{’host’}; . “\n”;
The parsing subroutine is part of a Perl script which downloads log files from the ISP, parses them, and produces summary reports showing visits to IPhone Cafe. The summary report contains totals by robots, desktop browsers, IPhone Safari browsers, etc.
***
Visit Log Parsing II for an example showing how web access log information can be used.
- Tony