Topics in self-help web hits

A. Web Logs from Itinfo

The itinfo web pages have the MIT counter url in them, but that counter truncates anything after the '?' in the url, and that is where the identifying page information is. So we can't use the counter info to get topic info. Happily, Itinfo.mit.edu runs its own apache web server, and thus generates httpd log files. We scissors these apart into monthly chunks, and run it through a web log analyzer to get hits by url. Hits by url is fed to a topics-from-urls spreadsheet engine that assigns a topic keyword to each url and then adds this hits per keyword. The urls we look at are only from users with an 18...* address -- I don't want to have to weed out the lurkers from overseas and the search engines, which generate huge numbers of hits. By keeping the data set campus-only we are likely to keep year-to-year comparisons as meaningful as possible.

Use SecureCRT to telnet to itinfo.mit.edu; log in as 'root'.
cd /var/log/httpd
grep Jul/2006 access_log > itinfo-2006-07.txt # (this is by way of example obviously)

Copies of these log files are just left on the server as disk space there doesn't seem to be an issue. (Log files are runnng about 100 M each right now. Compress should probably be applied tot the older ones.)
Transfer these to the PC where the web log analyzer is using SecureFX. The analyzer we currently use is WebLogExpert

WebLog Expert is set up to look at log files in this directory:

c:\projects\dashboard\publishing\itinfo-web-logs\fy2007

The log file analyzing script has some settings that are important:

look at addresses in 18...* only.
show 10000 pages in the "most popular" section of the report, and sort them from most to least. (10000 may be the max allowed)
export the report in CSV format, for ready inclusion in the topic-tagging spreadsheet engine.

Child pages

Topics in self-help web hits

A. Web Logs from Itinfo