Big data has many interpretations and definitions across industries, academics, organizations, and individuals. It’s a very broad and evolving field, and many organizations are adopting big data in some shape or form to supplement their existing analysis and business tools. Big data systems are primarily used to derive meaningful value and hidden patterns from data. They are also implemented to supplement different types of traditional workloads for economies of scale that lower costs. The three key sources of big data are people, organizations, and sensors.
Big data systems are characterized by a few attributes such as volume, velocity, a variety of data, and value; additional characteristics are veracity, validity, volatility, and visualization:
Value: The ultimate objective of big data is to generate some business value and purpose for the company by doing all the analysis with a big data project.
Volume of data: data system volumes can scale as per business needs to gigabytes, terabytes, petabytes, exabytes, zettabytes, and so on. Each business has unique volume needs; for example, an Enterprise Resource Planning (ERP) system could run into gigabytes of data, while Internet of Things (IoT) and machine sensor data could run into petabytes.
Velocity of data: The speed at which the data is accessed could be batch jobs, periodic, near-real-time, real-time data from web server logs, streaming data of live videos and multimedia, IoT sensor information, weather forecasts, and so on. We can correlate the quantity of SMS messages, Facebook status updates, or credit card swipes being sent every minute of every day by an organization.
Variety of data: Variety of data is one of the key ingredients of big data. The data can be of many forms, such as a structured data format similar to a sales invoice statement date, sales amount, store ID, store address, and so on, which can easily fit into traditional RDBMS systems; semi-structured data, such as web or server logs, machine sensor data, and mobile device data; unstructured data such as social media data, including Twitter feeds and Facebook data, photos, audio, video, MRI images, and so on. Structured data has a form and rules for a metadata model, and dates follow a specific pattern. However, unstructured and semi-structured data have no predefined metadata model rules. One of the goals of big data is to gather business meaning from unstructured data with technology.
Veracity of data: This is the trustworthiness of data; it should be devoid of bias and abnormalities. It’s ensuring the quality and accuracy of data gathered from different source systems, and doing preprocessing quality checks to keep data clean and ensure no dirty data accumulates.
Validity of data: Data should be correct and valid for its intended use, ensuring its appropriateness. Even in traditional data analytics, data validity is crucial for the program’s success.
Volatility of data: This is the shelf life of the data, its validity for the time period of intended use. Stale data will not be able to generate intended results for any project or program.
Visualization: Visualization through pictures appeals to the human eye more than raw data in metric or Excel format. Being able to visualize the data trends and patterns from the input data systems or streams till the end analysis is an asset to big data programs.
In the traditional data warehouse, the data is structured, like RDBMS data, and the schema is modeled for the data to be loaded into the database.
Data handling in big data systems from ingestion to persistence, computation, analytics, and so on is quite different from traditional data warehouse systems as data volumes, velocities, and formats are quite divergent from source system ingestion to persistence. These systems require high availability and scalability for staging to persistence and analytics.
Scalability is achieved through cluster-resource pooling of memory, computation, and disk space. New machines can be added to the cluster to supplement the resource needs as per varying workload demands, or increased data volumes with business expansion. High availability is quite important for critical systems performing real-time analytics, production systems, or staging and edge systems holding real-time data. High availability clusters mean ensuring fault-tolerant systems even in the event of hardware or software failures, ensuring uninterrupted access to data and systems.
Another prominent big data technology is In-memory computing, which encompasses both software and hardware technology advancements to handle the huge data loads of volumes, velocity, and variety in big data systems. We will discuss these in detail.