"The causal discovery in web logs" - problem and dataset


In our view an ideal causal problem might be one that has the following characteristics:

- It uses real (not simulated and not re-simulated) data;

- The real model has causal structure;

- The ground truth concerning the causal graph is beyond any doubt;

- It is possible to manipulate the variables and have an ultimate check of both ground truth and the proposed causal models.


What is given? Real data:

The anonymized logs of a web server.

For each day the number of hits (requests) for each of the overall most requested pages is available.


Input format:

A matrix of 512 days by 20 pages containing integer numbers, the frequency of the visits during that day.


Download URL for the input data: http://www.phobos.ro/data/l1ronc_sorted.txt

and for the corresponding calendar dates: http://www.phobos.ro/data/l1ronc_dates.txt

Right-click on each link to save the file to your disk. Please note that each visit of the website cover much less than a full day, so there is not much temporal (delayed) influence of variables on each other that we expect to be possible to see in the data.


What is requested? The causal structure:

The pages of an website have links to other pages in the same website and so on.

When an user goes from page A to page B by using a link inside page A linking to page B, we consider that the visit of the page B by that user has been caused by the visit of the page A. Any page can be an entry point (the first page visited in an user session) and any page also an exit point (the last visited page).

We request the causal graph to be reconstructed from the given data, for details see the output format description below.


Output format:

The matrix of 20 by 20 numbers having on the position (u,v) the probability that a visit of the page 'u' causes a visit of the page 'v'.

Thus, 1 means 100% causal implication (deterministic, each visit of the page 'u' causes a visit of the page 'v'), while 0 means no causal implication of the visits of page 'u' on the visits of page 'v'.


What we know already? The ground truth:

For each page visit, we know in principle the so called "referrer", the page visited by the user before visiting the current page.

This information will NOT be available to the participants, but will permit an objective evaluation of the proposed causal graphs.


Is this problem useful?

One might object that having the information about referrer and IP alleviate any need to apply causal methods to infer page causal dependencies.

But, there is a huge trend these towards awareness concerning privacy. The knowledgeable users do all they can to protect their anonymity by using simple means, such as disabling cookies, instructing their browsers not to report the previously visited page and so on.

On the server side, there is a similar trend, as companies face lawsuits and wish they had anonymized the details of the visits.

In other words, storing (even if you get it) IP information and referrer information is considered unadvisable.

This means that the future web logs will most probably miss this information. The on-line advertisers and the content owners will still face the same problem they have today, inferring the model in order to know what to change to maximize their revenue. It is not unthinkable that new rules in storing such privacy-related data could make this suddenly everybody's problem.



Evaluation:

As we do have the ground truth, we will compute the correlation between the given arc strengths and the measured transition probability for a hold-out set of the same size.

The contestant achieving the highest correlation will win.