Web Analytics using MongoDB
v Logging with MongoDB:
The
most basic requirement of web analytics is to log visits to different pages in
a web application. Following are the steps to learn how we can implement a
logger module that will log user requests to a web app in a MongoDB collection.
The steps are;
1.
The page being visited
2.
The IP address of the user
3.
The time of visit
4.
The user agent string of the browser
5.
The query parameters(if required)
6.
The time taken to generate a response, in
milliseconds.
User
can implement user request logging by creating a collection in MongoDB and
inserting the HTTP request data into this collection. A capped collection is a
collection in which we can specify the maximize size and it will always
maintain this size.
v Capped Collections:
A
capped collections is just like any other collection in MongoDB, except that if
we specify the size of the collection in bytes, it will maintain this size by
itself. That means when this collection grows larger than the specified size,
it replaces the oldest documents automatically with new ones. A capped
collection is created explicitly by calling createCollection(), unlike regular
collections which are created implicitly. A second parameter has to be passed
to this method specifying that this is a capped collection and the size of the
collection in bytes.
In the following example, we are working with the “gfg” database in which we are creating a new capped collection of name Student with maximum document capacity 4 using createCollection() method.
- Features:
1. Sorting in natural order:
Another
notable feature of a capped collection is that it implements natural ordering.
Natural ordering is the database’s native approach of ordering documents in a
collection. When we query a collection, without specifying to sort on a certain
field, we will get the documents in the order they were inserted. In a regular
collection, this is not guaranteed to happen because as we update the
documents, their sizes change and they are moved around to fit into the
collection. A capped collection on the other hand guarantees that the
documents are returned in the order of their insertion.
User can update documents in a capped collection the same way we update documents for a regular collection. But there is a catch; the document being updated is not allowed to grow in size (Otherwise capped collection could not guarantee natural ordering). Also, we cannot delete documents from a capped collection. We can however use drop() to delete the collection entirely.
- Convert a regular collection to a capped one
We can also
turn a regular collection into a capped collection by using the following
command;
>db.runCommand({'convertToCapped':
'r_coll', size : 1000000}) { "ok" : 1 }
v Extracting analytics data with MapReduce:
Generally,
the log to contain raw data about page visits, but we need to extract some
meaningful information out of it. For example, it might be useful to know how
many times a page has been viewed over a certain time period, or what is the
average response time for a page. It is also possible to do so by applying
MapReduce on the log.
It
is not generally a good idea to calculate analytics using such MapReduce in real time, especially if user are running a website that enjoys heavy traffic.
The log will be very large and constantly growing, so running MapReduce on it
would take time because MapReduce are known to be consistent and
continuous, but their speed depends on several factors. If we ran the page view
calculations job , it will take a long time to load the page. Rather, user
should run processes in the background that execute the MapReduce jobs, stores
the results in a collection, and have the analytics page simply read from that
collection.
0 Comments