I would like us to gather and publish our http logs somewhere so they can be fetched and reused from a central place (for specific stats for my part, but others may find interest there as well). This is particularly useful for www.m.o and releases.m.o web sites for the time being. Proposed, a data.mageia.org website providing as well a rsync access (under data.mageia.org/logs/httpd here for instance). Still, this should be kept under a limited access as long as: * we did not update our privacy policy regarding IP addresses, * and/or we did not replace IP addresses with a more global (country, location) markers. Bruno Mahe (bmahe) told me previously about the possibility to use Flume to collect directly Apache logs into a Hadoop instance (and to allow computing on the logs afterwise). That could be nice but as a second step.
Assignee: sysadmin-bugs => mageia
CC: (none) => bruno.mahe
Blocks: (none) => 2330
Keywords: (none) => Atelier
Some progress on this. I now have a set of scripts to extract www.m.o downloads data and releases.m.o pings data, so we have an idea of the trends of downloads and installed desktop systems out there. Those scripts do a daily (N-1) logs extraction, and either remove IP address, either resolve it to a country/city level, so resulting logs are safe to be made public. I intend to release those filtered logs under the Open Database License (http://opendatacommons.org/licenses/odbl/) + a readme: - full log, compressed archive, - sample log data, - readme describing data extraction and log fields. That will provide support for further digest logs, graphics and reports on this data. So I need this: - /var/www/vhosts/data.mageia.org/ mirroring svn.mageia.org/svn/web/data (not created yet) - data.mageia.org vhost, publishing /var/www/vhosts/data.mageia.org/public - a cron entry calling /var/www/vhosts/data.mageia.org/cron/daily.php everyday at 6:00 - the user calling this cron needs * a ssh access to champagne.m.o, to read and grep the vhosts logs under /var/log/httpd * we may, of course, find another way to grab these logs extracts, if they are on the same host, for instance - I'm just describing current behaviour of my scripts. It may still require some manual access to reformat/cleanup logs built from cron (from daily extracts to monthly ones) and to add older logs built locally (history of the past few months for instance).
CC: (none) => filip.komar
CC: (none) => marja11Assignee: mageia => sysadmin-bugs