Tuesday, June 05, 2007

Sorting IP Addresses in VFP

In my series, "Business Intelligence Through Web Analysis", I talk about using web stats to build a web analytics solution in FoxPro. I have created a webstats table, and have populated it with some data from my access logs. I have a requirement to be able to sort my webstats table by IP address (among other things).

The format of an IP address doesn't lend itself to proper indexing (it is a 32-bit numeric address written as four numbers (octets) separated by periods). To accommodate my requirement, my first thought was to break the IP address into 4 separate numeric fields when importing. This would be good normalization. After all, the rule is that atomic data is better than concatenated data. But rules are meant to be broken. And, more importantly, does it make sense to atomize an IP address? An IP address is not really 4 separate numbers, the entire address is meaningful, even though each piece holds some significance.

So, I decided to keep the IP address intact.

Still needing a good way to sort and group, I thought about converting the IP address to its binary representation. MySQL has a function, INET_ATON (Adress TO Number), that does exactly that. FoxPro's CREATEBINARY function seems to do the trick at first. I can pass in the complete IP address like so:


But if you see these outputs, you'll realize that they're not great for indexing (which is the only reason I want to convert them). CREATEBINARY creates strings of varying length, which is not what I want for indexing.

Back to the drawing board. I could represent each octet as 0 and 1 like this:

lnOct = 122
lcBin = ''
DO WHILE lnOct > 0
lcBin = STR(lnOct%2, 1) + lcBin
lnOct = INT(lnOct/2)
? PADL(lcBin,8,'0')

But this seems terribly inefficient (I admit, no benchmarking was done, but this code is overkill!).

With all this in mind, I think I have a viable solution. It doesn't require a custom function, nor does it require storing data in separate fields. I know I can easily extract each octet of the IP address by using GETWORDNUM. And, knowing a little something about IP addresses, each octet has 2^8 (or 256, values 0 to 255) possible combinations. Therefore, an IP Address can be represented in a numeric form like so:

? VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ;

So my index for webstats.dbf would be:

SELECT webstats
VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ;
,8) TAG IP

Your comments and feedback are welcome. Is there a better solution?

No comments: