plainblack.com
Username Password
search
Bookmark and Share
Subscribe

Passive Analytics

icon-passiveanalytics48x48.png

Passive Analytics (PA) is a system for analysing website usage.

Page Views

PA records every single Page View as a database row.

Page Views vs. Hits

PA is integrated into WebGUI's Asset content handler, which means that only resources that are served as WebGUI assets by mod_perl (as opposed to static content served by mod_proxy) will be recorded as page views. It is important to understand that when a user views a web page, their browser performs many HTTP requests to download all of the components of the page: first the HTML for the page itself, and then all of the referenced content (images, javascript files, etc..). PA deliberately attempts to only record a single row per page request, corresponding to the asset located at the requested page URL.

Page View Data

Each row corresponding to a Page View contains the following information:

  • userId of the user who viewed the page
  • assetId of the page
  • sessionId of the current Session
  • timeStamp - the precise time the page view occurred
  • url (including any query parameters, e.g. /home?op=makePrintable)

You can export this table in CSV format by clicking on the Passive Analytics icon in the Admin Console and clicking on the "Export raw logs" link. One potential use of this raw page view data would be to count the number of page views per user.

Page Views vs. Apache Logs vs. Google Analytics

Web server access logs are the most common way that web usage data is collected and analysed. The WRE is configured to record per-site Apache access logs, and ships with the Awstats program for analysing this data. Another common way of analysing web usage is to include the Google Analytics javascript code on your website, which causes information about each page view to be sent to Google's servers, for later analysis through Google's fancy external reporting website. Both of these methods are useful and feature-rich. The goal of PA is not to compete with these approaches or reproduce any of their sophisticated features.

What PA has that these other methods do not is access to WebGUI-specific information at the time that Page Views are being recorded, namely: userId, assetId and sessionId. This allows you to analyse usage data in reference to user-specific data that these other methods do not have access to. By extending PA, you can arrange to have additional site-specific information recorded in the Page View table, such as what groups the user was a member of at the time the page was viewed. This allows for powerful usage analysis that is highly relevant to your website.

 

Disabled by Default

WebGUI ships with PA disabled by default, for the following reasons:

  • If your site receives a lot of hits, the PA hits table will grow to be very large
  • PA adds an extra database write on every page view. This could potentially slow down your site.
  • Collecting detailed site usage data may be an invasion of your users privacy. At a minimum should should check if you are required by law to inform users what data is being collected and how it will be used.

 

Time Spent

The most common use case for PA is analysing how long users spend on different parts of your site.

To start with though, all you know is that a user viewed page p1 at time t1, and then page p2 at time t2. If the time delta between these two page views (dt = t2 - t1) is 5 seconds, it's probably quite reasonable to conclude that the user spent 5 seconds looking at p1. If you were summing up the total time spent on p1, you would add dt to your running total. But what if dt is 5 minutes, or 5 hours? In this case it's much more likely that the user did something else between viewing p1 and p2, such as making themselves a cup of coffee, browsing Facebook, rebooting their computer, etc..

This is an inherent problem with website usage analysis - you can never really be sure how long someone spent viewing a page.

PA deals with this problem by letting you specify a cut-off threshold for the delta between two page views. For instance, if you set this threshold to 60 seconds, then any dt values less than 60 would be added to the running tally for p1, whereas anything greater than 60 would be ignored.

When performing analysis, PA first scans through all rows in the raw page view log and builds a second temporary table called the delta log, that contains the following information:

  • userId (as per page view table)
  • assetId (as per page view table)
  • timeStamp (as per page view table)
  • url (as per page view table)
  • delta - the amount of time between this page view and a subsequent page view
  • This table is smaller than the raw page view log, because any rows where the delta is bigger than the cut-off threshold are skipped over.

    Buckets

    To analyse how long users spent on different parts of your website, you need to tell PA how to aggregate pages into categories (or buckets as PA calls them). To do this, you use the PA admin console interface to build up a list of regular expressions, one per bucket. PA scans through each row in the delta log, and tries to match the associated url against each of your bucket regular expressions in turn. The first one to match causes the row to be assigned to that bucket. If none of the regular expressions match, the entry is assigned to the "Other" bucket (which serves as a last resort catch-all).

    Analysing Data

    Since PA has to scan through potentially millions of records, an asynchronous reporting system is used. First, visit the PA admin console section and enter your preferred cut-off threshold and bucket list. Then click "begin analysis". PA will start analysing your data, according to the settings you have defined. At this point, the PA admin console tells you that analysis is currently in progress.

    When the PA workflow finishes analysing your data, it sends an email out to the user who clicked the "begin analysis" link. At this point, returning to the PA admin console page displays a message saying that analysis is finished, and provides links to download both the delta log and the bucket data as CSV files.

    Since both the delta log and the bucket data are generated from the passive log, they can both be deleted without loss of information. The PA admin console section lets you choose whether the delta log is automatically deleted when analysis finished (to help conserve disk usage). Each time analysis is run, the previous delta log and bucket data tables are cleared.

    The passive log is never deleted, which means that later you can modify your cut-off threshold setting and/or bucket list and re-run the analysis.

    Analysis Performance

    PA was designed to handle large amounts of data. The speed of PA analysis will depend on the number of page views in your database and the number and complexity of the regular expressions you use. PA has been benchmarked processing 5 million records with a set of 20 regular expressions in 40 minutes. As with all benchmarks, you should take this with a grain of salt.

    Future Extensions

    It would be nice if PA displayed some graphs and statistics via the web interface rather than requiring you to download the CSV files and perform your analysis offline.

    It's probably reasonable to assume that all page views involve at least a small amount of viewing time on the user's part, even if only for a second or two. For that reason, it would be nice if the PA admin console allowed you to specify a minimum value of dt to be used in cases where a page view is going to be skipped.

    Keywords: awstats Passive Analytics web stats

    Search | Most Popular | Recent Changes | Wiki Home
    © 2018 Plain Black Corporation | All Rights Reserved