15 Jun 2005
I’ve developed two implementations of a system to track hits on websites. The first attempt
is still online used to live at counter.jokke.dk, but has been taken offline as of September 2006. It was very limited in what statistics it recorded. The second never made it past beta testing, revealed great inefficiency in the data processing, and was only used by a few selected testers.
Where To From 404?
The above problems got me thinking how this should really be done to provide useful statistics to those without access to Apache’s logfiles. So far I’ve been saving all the data in MySQL and then extracted the statistics from there for further processing. This can get somewhat inefficient, as noted above, when dealing with hundreds of thousands of records and you most of the time are interested only in statistics from the last week.
While building the second implementation, which I will call counter2 here, and trying to locate the culprit of said performance hit, it turned out to be PHP and the way it deals with huge arrays [hundreds if not thousands of items, each an array in itself]. PHP may be good at many things but intensive calculations involving a lot of memory allocation is apparently not one of them. How can we get around this limitation?
A New Approach
I’ve tried several log file analysis tools and the ones built on C are, not surprising, very fast. They do, however, require access to Apache’s log files. This is where counter3 enters.
By making counter3 middleware so to speak between the individual hits and the statistics generator, not only will we gain a dramatic speed increase but also be able to benefit from the smart people who built the analysis tool. Though it can be great fun, there rarely is a need to reinvent the wheel.
Not having direct access to Apache’s own log files obviously presents limitations on what information can be had. Apache notes, for instance, how big the requested file was and the response code. These and statistics on files other than html can not be handled by counter3 [some info on the size of a document can be had with
document.body.innerHTML.length, but I’m not sure how useful it would be given that images are not included].
Development of counter3 has yet to begin, but I believe that the above outlined method represents near the optimal way of doing statistics on a webpage without access to the log files. Suggestions, comments and the like are more than welcome.
September 2006 Update: With the recent announcement of Google Analytics, the enticement to do a new version of my counter is limited. The thoughts below are still interesting, but I have no plans of starting development.