Bug 4034 - Central repository for logs
Summary: Central repository for logs
Status: NEW
Alias: None
Product: Infrastructure
Classification: Unclassified
Component: Others (show other bugs)
Version: unspecified
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Sysadmin Team
QA Contact:
URL:
Whiteboard:
Keywords: Atelier
Depends on:
Blocks: 2330
  Show dependency treegraph
 
Reported: 2012-01-05 17:44 CET by Romain d'Alverny
Modified: 2015-04-21 19:07 CEST (History)
4 users (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments

Description Romain d'Alverny 2012-01-05 17:44:33 CET
I would like us to gather and publish our http logs somewhere so they can be fetched and reused from a central place (for specific stats for my part, but others may find interest there as well).

This is particularly useful for www.m.o and releases.m.o web sites for the time being.

Proposed, a data.mageia.org website providing as well a rsync access (under data.mageia.org/logs/httpd here for instance).

Still, this should be kept under a limited access as long as:
 * we did not update our privacy policy regarding IP addresses,
 * and/or we did not replace IP addresses with a more global (country, location) markers.

Bruno Mahe (bmahe) told me previously about the possibility to use Flume to collect directly Apache logs into a Hadoop instance (and to allow computing on the logs afterwise). That could be nice but as a second step.
Romain d'Alverny 2012-01-05 17:45:10 CET

Assignee: sysadmin-bugs => mageia

Bruno Mahe 2012-04-24 00:24:26 CEST

CC: (none) => bruno.mahe

Romain d'Alverny 2012-05-24 23:12:42 CEST

Blocks: (none) => 2330

Romain d'Alverny 2012-07-04 13:57:46 CEST

Keywords: (none) => Atelier

Comment 1 Romain d'Alverny 2012-07-30 01:51:28 CEST
Some progress on this. I now have a set of scripts to extract www.m.o downloads data and releases.m.o pings data, so we have an idea of the trends of downloads and installed desktop systems out there.

Those scripts do a daily (N-1) logs extraction, and either remove IP address, either resolve it to a country/city level, so resulting logs are safe to be made public.

I intend to release those filtered logs under the Open Database License (http://opendatacommons.org/licenses/odbl/) + a readme:
 - full log, compressed archive,
 - sample log data,
 - readme describing data extraction and log fields.


That will provide support for further digest logs, graphics and reports on this data.

So I need this:
 - /var/www/vhosts/data.mageia.org/ mirroring svn.mageia.org/svn/web/data (not created yet)
 - data.mageia.org vhost, publishing /var/www/vhosts/data.mageia.org/public
 - a cron entry calling /var/www/vhosts/data.mageia.org/cron/daily.php everyday at 6:00 
 - the user calling this cron needs * a ssh access to champagne.m.o, to read and grep the vhosts logs under /var/log/httpd

* we may, of course, find another way to grab these logs extracts, if they are on the same host, for instance - I'm just describing current behaviour of my scripts.

It may still require some manual access to reformat/cleanup logs built from cron (from daily extracts to monthly ones) and to add older logs built locally (history of the past few months for instance).
Filip Komar 2012-08-24 21:18:11 CEST

CC: (none) => filip.komar

Marja Van Waes 2015-04-21 19:07:59 CEST

CC: (none) => marja11
Assignee: mageia => sysadmin-bugs


Note You need to log in before you can comment on or make changes to this bug.