Real time analytics using MongoDB
MongoDB
is a top database choice for application development. Developers choose this
database because of its flexile data model and its inherent scalability as a
NoSQL database. These features enable development teams to iterate and pivot
quickly and efficiently. MongoDB was not originally developed for the high
performance analysis. Yet, analytics is now a vital part of modern data
applications. Developers have formed ingenious solutions for real time
analytical queries on data stored in MongoDB, using in house solutions or
third-party products.
Using
the Mongo upsert and $inc features, we can efficiently solve the problem. When an app server renders a page, the app
server can send one or more updates to the database to update statistics.
We
can be do this efficiently for a few reasons.
First, we send a single message to the server for the update. The message is an “upsert” – if the object
exists, we increment the counters, if it does not, and the object is
created. Second, we do not wait for a
response – we simply send the operation, and immediately return to other work
at hand. As the data is simply page
counters, we do not need to wait and see if the operation completes (we
wouldn’t report such an error to our web site user anyway). Third, the special $inc operator lets us
efficiently update an existing object without requiring a much more expensive
query/modify/update sequence.
There
are mainly two methods to perform analytics using MongoDB;
1. Replicating a MongoDB database into a SQL database:
Replicating
data into a SQL database allows users to keep on using MongoDB as their
production database and use the relational format to analyze data with ease.
SQL can used on this relational version of MongoDB data. This allows users to
access and maintain data with ease and combine data from multiple tables using
indexes to perform insightful analysis.
SQL
brings in a lot of conveniences when working with lengthy aggregations and
complex data joins. However, data replication is not as easy as it sound. This
requires an ETL job which might be complicated as it requires transferring data
from a NoSQL environment to a SQL environment.
2. Data Virtualization:
Data
Virtualization is a method that can be used for MongoDB real time analytics.
This method is the ideal solution to counter the limitations or replicating
databases. Various tools provide an interactive and user friendly interface.
These tools can be connected with MongoDB with ease and allow the users to
query or manipulate their data stored in MongoDB. Users can now develop
visualizations and perform real time analysis in just a few clicks making use of
smart and easy to use dashboards and customer facing reports. The advantage
here is that it doesn’t require any additional hardware or tedious ETL jobs to
analyze data.
One such tool
is Apache Spark. MongoDB supports this popular framework that is loved by data
scientist, engineers and analysts. MongoDB provides powerful large scale
analytics features. These allow users to perform analytics within the platform
by converting data into visualizations along with a parallel query execution
engine to boost the performance. MongoDB also supports a SQL based BI connector
that allow users to explore their MongoDB data using different business
intelligence tools such as Microsoft BI, etc.
MongoDB supports ad-hoc querying. It is very flexible and supports all different kinds of data.
2. Powerful Analytics:
MongoDB
supports real time analytics with a wide variety of data. It allows performing
analytics on secondary data, and even on text searches. It has strong
integrations with aggregation frameworks and the MapReduce paradigm.
3. Speed:
MongoDB
being a document-oriented database, allows us to query data quickly. Its rich
indexing capabilities allow it to perform way faster than a relational database.
4. Easy Setup:
MongoDB
can be set up easily on any system.
5. Scalability:
NoSQL
databases are built to scale. MongoDB’s sharding capability allows it to
distribute data across datasets, servers etc. This gives it an unlimited growth
capability and a higher production rate than a relational database.
6. Data Adaptability:
A
NoSQL system like MongoDB supports a wide variety of data such as text data,
geospatial data, etc. It provides an ultra-flexible data model making it easier
to incorporate data and making adjustments for better performance.
7. Real-Time:
With
MongoDB, user can analyze data of any structure within the database and get
real-time results without costly data warehouse loads.
1. No support for Joins:
MongoDB
doesn’t support joins. Joins are implemented using programming languages such
as Java, however, this makes the querying complex performance.
2. Memory Constrains:
MongoDB
leads to unnecessary usage of memory. It stores every key value pair and hence
suffers from duplication of values.
3. No Referential Integrity:
These
are the defined and validated relations between different pieces of data.
Referential Integrity helps to keep the information consistent and adds another
layer of validation underneath the programmatic one.
v Using MongoDB with Relational Databases:
Relational
databases have been around for decades. Programmers have built countless
applications, web-based or other type of applications, on top of such
databases. If the domain of the problem is relational, then the sing an RDBMS
is an obvious choice. The real world entities are mapped into tables, and the
relationships among the entities are maintained using more tables. But there
could be some parts of the problem domain where using a relational data model
will not be the best approach, and perhaps we may need a data store that
supports a flexible schema. In such scenarios, we could use a document oriented
data storage solution as MongoDB. The application code will have separate
modules for accessing and manipulating the data of the RDBMS and that of
MongoDB.
Potential
use cases:
1. Storing results of aggregation queries:
The
results of expensive aggregation queries (count, group by, and so on) can be
stored in a MongoDB database. This allows the application to quickly get the
result from MonogDB without having to perform the same query again, until the
result becomes stale (at which point the query will be performed and the result
will be stored again). Since the schema of a MongoDB collection is flexible, we
don’t need to know anything about the structure of the result data beforehand.
The rows returned by the aggregation query could be stored as BSON (Binary
JSON) documents.
2. Data Archiving:
As
the volume of data grows, queries and other operations on a relational table
increasingly take more time. One solution to this problem is to partition the
data into two tables: an Online table, which contains the working dataset, and
an archival table that holds the old data. The size of the online table will
remain more or less the same, but the archival table will grow larger. The
drawback of this approach is that when the schema of the online table changes,
we will have to apply the same changes to the archive table. This will be a
very slow operation because of the volume of the data. Also, if we drop one or
more columns in the online table, we will have to drop those columns in the
archive tables too, thus losing the old data that might have been valuable. To
get around this problem, we could use a MongoDB collection as the archive.
3. Logging:
We
can apply MongoDB for logging events in an application. We can use a relational
database for the same purpose, but the Insert operations on the log table will
incur an extra overhead that will make the application response slower. We can
also try simple file based logging, but in that case, we would have to write
our own, regular-expression- powered log parsing code to analyze the log data
and extract information out of it.
4. Storing entity metadata:
The
application that you built maps the entities of the domain into tables. The
entities could be physical, real-world objects (users, products, and so on), or they could be something virtual (blog posts, categories, and tags). User can
determine what pieces of information user need to store for each of these
entities, and then user design the database schema and define the table
structures.
v Defining the relational model:
The
relational data model provides conceptual tools to design the database schema
of the relational database. The relational model describes the data,
relationship between that data, data sematic and constraints on the data in the
relational database.
The
relational model expresses the data and relationship among the data in the form
of tables. A relational model is popular for its simplicity and possibility of
hiding the low level implementation details from database developer and
database users. Relational data model expresses the database as a set of relations.
Each relation has columns and rows which are formally called attributes and
tuples respectively. Each tuple in relation is a real world entity or
relationship.
1. The database is set of related relations.
2. Each relation has a name which indicate what type of tuples in relation has. For example, a relation name student indicates that it has student entities in it.
3. Each relation has a set of attributes which represents different types of values.
0 Comments