Friday, May 25, 2007

Analyze Apache Raw Access Logs in FoxPro

I've been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I've used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I've always felt a bit handicapped by the interface and general content. I have my own ideas on what I'd like to see and how I want to see it.

In this and subsequent articles, I'll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.

My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there's even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I'll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:

Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)

The Log File



I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each 'hit' to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.

The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:

206.17.xxx.xxx - - [23/May/2007:07:59:21 -0500] "GET /Styles/color.css HTTP/1.1" 304 - "http://www.rabbitstewlounge.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)"

Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today's access file from my site, and wrote the following code. As this project matures, I'll be sure to clean this up (hopefully with some reader's suggestions!) and provide the source in zip format. For now, follow along:

Get Started


First, download your raw access file. Load cPanel and click "Raw Access Logs". Select the raw access log for your domain and unzip it. In this example, I've also changed it's name from "rabbitstewlounge.com" to "raw_log.txt".

The Basic Code


I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven't included any real error checking. I'd be interested in hearing some ideas on how to tweak this for better speed and reliability.

I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.


CREATE DATABASE C:\webstats\webstats
OPEN DATABASE C:\webstats\webstats
CREATE TABLE C:\webstats\webstats.dbf (;
stat_id integer AUTOINC,;
ip_address character(15),;
date_time datetime,;
gmt_offset character(5),;
access_method character(4),;
location varchar(200),;
http_version character(10),;
http_code character(3),;
bytes i,;
referrer varchar(200),;
agent varchar(200),;
source_file varchar(60))


lcData = FILETOSTR("C:\webstats\raw_log.txt")

FOR n = 1 TO MEMLINES(lcData)

lcLine = MLINE(lcData,n)

*-- IP address is first.
lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1))

*-- date time and GMT offset info. Use STREXTRACT to get the data between
*-- the first two quotes on the line. Then break the datetime string
*-- into separate lines to use MLINE to build a proper datetime string
lcDateTimeGMT = STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13))
lcDT = STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13))
IF !EMPTY(lcDT)
ltDatetime = DATETIME(;
VAL(MLINE(lcDT,3)),;
get_monthnum(MLINE(lcDT,2)),;
VAL(MLINE(lcDT,1)),;
VAL(MLINE(lcDT,4)),;
VAL(MLINE(lcDT,5)),;
VAL(MLINE(lcDT,6)))
ELSE
ltDatetime = {}
ENDIF

lcGMTOffset = MLINE(lcDateTimeGMT,2)

*-- Request data. Just to 'switch-it-up' a bit, I'm using GETWORDNUM
*-- here because I need 3 chunks of data within the STREXTRACT, all
*-- conveniently separated by a space.
lcRequest = STREXTRACT(lcLine,'"','"')
lcMethod = GETWORDNUM(lcRequest,1)
lcLocation = GETWORDNUM(lcRequest,2)
lcHTTPVer = GETWORDNUM(lcRequest,3)

*-- HTTP Code and bytes returned.
lcHTTPCode = SUBSTR(lcLine, AT(lcRequest,lcLine)+LEN(lcRequest)+2,3)
lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ;
" ",lcLine)+LEN(lcHTTPCode)+2,;
AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+;
LEN(lcHTTPCode)+2)))

*-- referrer
lcReferer = STREXTRACT(lcLine,'"','"',3)

*-- User's agent
lcAgent = STREXTRACT(lcLine,'"','"',5)

INSERT INTO webstats (ip_address,date_time,gmt_offset,access_method,;
location,http_version,http_code,bytes,referrer,agent,source_file ) ;
VALUES (lcIP , ltDatetime , lcGMTOffset , lcMethod , lcLocation ,;
lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,lcAgent,"raw_log.txt" )

NEXT


FUNCTION get_monthnum
LPARAMETERS tcAbbr

DIMENSION aMonthList[12]
aMonthList[1] = "JAN"
aMonthList[2] = "FEB"
aMonthList[3] = "MAR"
aMonthList[4] = "APR"
aMonthList[5] = "MAY"
aMonthList[6] = "JUN"
aMonthList[7] = "JUL"
aMonthList[8] = "AUG"
aMonthList[9] = "SEP"
aMonthList[10] = "OCT"
aMonthList[11] = "NOV"
aMonthList[12] = "DEC"

RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3))
ENDFUNC


Next entry, I'll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I'll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!

2 comments:

Garrett said...

It's always interesting to see how different people attack the same problem. You normalized the string delimiters: I coded around them.

Tod McKenna said...

Nice approach Garrett! Thank you for the link -- I also find it interesting to see how a problem can be solved from various angles. Since my last post, I've revised my code a bit more, and will be posting an update soon...