Monday, January 19, 2015

Big Data and the Analytics Data Store

To begin at the beginning

Hold this thought: If Data Warehousing was Tesco then Big Data would be the “try something different”.
Since the publication of the article Aligning Big Data, which basically laid out a draft view ofDW 3.0 Information Supply Framework and placed Big Data within a larger framework, I have been asked on a number of occasions recently to go into a little more detail with regards to the Analytics Data Store (ADS) component. This is an initial response to those requests.
To recap, the overall architecture consists of 3 major components: Data Sources; Core Data Warehousing; and, Core Analytics.
Data Sources – This element covers all the current sources, varieties and volumes of data available which may be used to support processes of ‘challenge identification’, ‘option definition’, decision making, including statistical analysis and scenario generation.
Core Data Warehousing – This is a suggested evolution path of the DW 2.0 model. It faithfully extends the Inmon paradigm to not only include unstructured and complex data but also the information and outcomes derived from statistical analysis performed outside of the Core Data Warehousing landscape.
Core Statistics – This element covers the core body of statistical competence, especially but not only with regards to evolving data volumes, data velocity and speed, data quality and data variety.
ADS1
Fig.1 – 3 components of the Information Supply Framework  
This piece will focus on the Core Statistics segment and in particular the Analytics Data Store, which is specifically designed to support professional statistical analysis and at the same time to support the speculative use of data.
ADS2
Fig.2 – Core Statistics – Analytics Data Store  

The Analytics Data Store

Daniel Keys Moran once stated that “You can have data without information, but you cannot have information without data.“, we’ll deal with that nonsense at another time.
The Analytics Data Store is the reference data store collection for the entire Core Statistics segment.
The following is a high-level diagram of the Analytics Data Store together some of its major option features:
ADS3
Fig.3 –Inside the Analytics Data Store  
Operating System Platform – Typically the operating system platform will be a flavor of UNIX(Linux or some other flavor).
The standard UNIX distributions can support parallel file manipulation commands, for mapping and reducing data in files that can be theoretically in the order of zebibytes.
Additionally, Hadoop Distributed File System can be overlaid on the UNIX platform to leverage the underlying UNIX primitives giving it access and control over the underlying devices, whether that device is a file, disk, cluster, node or anything else (but these files cannot be manipulated using regular UNIX primitives unless using something like FUSE)..
Hadoop – Hadoop is a set of algorithms (an organised collection of code) for distributed storage and processing of data sets on clusters of commodity computer hardware. The modules in Hadoop are designed with the idea that hardware failures are commonplace and should be automatically handled by the software. This is not however unique to Hadoop as there are UNIX distributions that also fulfill these functions, and then some. However, the attraction of open source software running on commodity hardware cannot be dismissed lightly.
Relational DBMS – This is the database model that most people who know anything about databases are familiar with. RDBMS is based on the relational data model. The relational data model provides an uncomplicated view of data to all users by representing data in two-dimensional tables of rows and columns. These tables are called relational tables. A relational database is a collection of relational tables. RDBMS is the data manager for relational databases.
Relational DBMS users use Structured Query Language (SQL), the industry-standard relational database management language, and with typically some extensions to SQL, to interact with the databases.
Document DBMS – This is a class of database management system oriented towards the management of unstructured, semi-structured and complexly structured documents, primarily digital textual documents. Examples of what might be labeled Document-oriented DBMS include Documentum EDMS and MongoDB.
Graph DBMS – Also known as Semantic Data Model Databases (back in the day). According to Wikipedia “a graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data.” One of the features of some of the early Graph DBMS (my first contact with this technology was at Unisys in the late eighties with a product called InfoExec,) was that the query languages allowed for structured queries to be stated in more business-like terms.
Key-Value DBMS – One can either view this type of database as an innovative reuse of the design of simple programmatic ‘collections’ (trust Microsoft to be the only ones to name a simple thing with a simple name,) used to structure data, then applied to the realm of database management, or as a mental aberration invented by bodgers and hackers. At the end of the dayKey Value DBMS simply provides a simple means to store in memory ‘associative array stores’ on disk. If there is more to it than that then please let me know?
Object DBMS – Object-oriented database management system stores information is represented in the form of objects – as used in object-oriented programming.
Object-relational databases are a hybrid of both the object oriented and relational approaches. I have found use for object-relational in operational applications, but never in MIS reporting, OLAP, Data Warehousing, Business Intelligence or Statistics. Does anyone have an alternative perspective?
Column Oriented DBMS – This refers to how data is stored. Typically we now view data as being stored in rows or records, but it’s not the only way of storing data.
Column Oriented DBMS store data first by values in ‘columns’, hence the name.
Examples of this type of database implementation go from Apache HBase as a distributed NoSQL column-oriented store built on top of HDFS, to EXASOL, currently the world’s fastest in-memory database management system.
As you see, the Analytics Data Store is fast becoming a super-fantastic mix of artefacts, gadgets and toys which should satisfy everyone; from the most experienced and knowledgeable statisticians, passing by the data ‘creatives’, the data scientists and the data data-users to the most game oriented of data plumbers and punters.
The ADS is above all about quality over quantity, the now over the mañana, and the ‘just do it’ over the ‘can we?”.
But, also remember these words from Colin Powell: “Experts often possess more data than judgment.” So, be forewarned and forearmed.

Using the Analytics Data Store

What are the applications that the Analytics Data Store might be used to support?
Here is a non-exhaustive list (first described in the mid eighties) of the potential applications:
Interpretation – Inferring situation descriptions from the analysis of a variety of data.
Prediction – Inferring likely consequences based on situational data.
Diagnosis – Inferring deviations and malfunctions from observables – from data.
Design – Analysing data and configuring objects under constraints.
Planning – Designing actions based on data feedback and analysis.
Monitoring – Comparing observations to known plan vulnerabilities.
Debugging – Prescribing remedies for malfunctions based on the analysis of data.
Repair – Devising and executing a plan to administer a prescribed remedy.
Instruction – Diagnosing, debugging and repairing behavioural patterns captured in data.
Control – Interpreting, predicting, repairing and monitoring systems behaviour.
Given the availability and quality of data to support the activities listed above, the Analytics Data Store can provide a sound source of data for a wide range of statistical analysis, forensic and speculative activities.
The Analytics Data Store is developed iteratively to support the data needs of a range of activities, from main stream statistical analysis, and formal data mining to creative and eclectic exercises in speculative analytics and non-traditional data correlation. This will ensure that business value can be assessed sooner rather than later.
The Analytics Data Store is essentially technology implementation agnostic and that it has a clear mission and business objectives within an overall Information Supply Framework.
The choices of technology products are based on best fit criteria, so the use of technology should not be driven by the old commercial approach of ‘solutions in search of problems’ approach, which failed so miserably time after time again, but on ‘what are the most appropriate artefacts, resources and technologies to use in approaching this problem or testing this hypothesis?’

That’s all folks

The description of the Analytics Data Store has been necessarily terse. But I hope that it gives a flavour of where the ADS fits into an overall Information Supply Framework that extends the enterprise Data Warehousing paradigm (DW 2.0) without disrupting ‘business as usual’ or by destructively distorting the purpose, architecture and management principles of Data Warehousing.
The Analytics Data Store and the much broader DW 3.0 Information Supply Framework are also aligned with the much longer term objective of addressing the knowledge, information and data needs of organisations.
What follows is a highly synthesised view of the long term. Which I will leave for now without further comment.
ADS4
Fig.4 – The Iniciativa Knowledge Management Pyramid  
Thank you so much for reading.

No comments: