To enable the new breed of analysts and data scientists to deliver high quality data analytics, IT teams need to augment conventional data quality processes with modern measures aimed at improving the usability of data
As data grows, more data means more data problems, data governance and quality processes are now a key focus to ensure analytics data is trustworthy
A key challenge is to articulate what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity. But there are many different lists of dimensions, and even some common terms have different meanings. The mistake of solely relying on a particular list without building an underlying foundation of what you are looking to accomplish is commonplace but can easily be avoided
The data quality challenge grows as data volume and variety explodes in Sparc and Hadoop clusters often accumulated from wide sources such as transaction systems, sensors and devices, clickstreams and unstructured data sets
A well designed modern big data platform mitigates these risks by delivering an extensible framework to deliver data quality management & provenance outside of the traditional lists of data quality dimensions
Here are some aspects of efforts to focus on
Easy access: Good clean data needs to be at the Data Scientists fingertips. Once in a data lake use of a tool such as Azure Data Catalogue delivers easy access to data sets with qualitative descriptions of what’s contained in the data sets, where they are located and detailing access permissions to them.
Conformance to Reference Models: A reference data model provides a logical description of data assets, such as customers, products & suppliers. However, unlike a conventional relational model, it abstracts the attributes and properties of the asset and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When gathering, processing and disseminating asset data among a set of source and target systems, confirming to a reference model delivers a consistent representation and interpretation of data
Synchronisation: In big data environments, best practice is for data sets to be replicated between different platforms. Generating data extracts for individual users is a poor use of time whereas well managed replication provides synchronisation on the data among all replicas delivering consistency and uniformity of shared data and driving faster usability
Data Provenance: The precision required of big data analytics will be enhanced if data is accurately identifiable across its entire life-cycle, from creation to ingestion, integration, processing & production
The consideration of usability of data is a key mantra of Elastacloud when designing big data platforms. It has been proven to us many times that an early focus on cleansing, cataloguing and curation of data assets drives massive productivity increases and fast insights from Data Science teams. This is Lean Analytics in action