A multicloud data lake is a centralized repository that stores raw, unprocessed data from multiple sources and clouds, making it a game-changer for businesses looking to unlock the full potential of their data.
This approach allows for greater flexibility and scalability, as well as improved data governance and security. It's a cost-effective solution that enables you to store and manage large amounts of data from various sources, including on-premises systems, cloud storage services, and big data platforms.
To get started with a multicloud data lake, you'll need to choose a data lake management platform that can handle data from multiple sources and clouds. This platform should provide features such as data ingestion, processing, and storage, as well as security and governance capabilities.
Data ingestion is a critical component of a multicloud data lake, and it's essential to choose a platform that can handle high-volume data ingestion from various sources.
What Is a Platform?
A platform is simply a collection of integrated tools and services that work together to provide a complete solution for a specific problem. In the case of a cloud data platform, it's a software solution that enables enterprises to aggregate, store, process, and consume data in the cloud.
Cloud data platforms can be pre-integrated products or modular solutions built by integrating open-source and best-of-breed technologies. This flexibility is a major advantage, allowing organizations to choose the best approach for their needs.
Public cloud services like AWS, GCP, and Microsoft Azure provide scalable compute resources and proprietary cloud services for ingesting, processing, analyzing, and serving data. This is a cost-effective way for enterprises to manage their data.
SaaS solutions from independent vendors like Snowflake and ChaosSearch can integrate with public cloud storage to deliver strong cloud data platform capabilities, including data processing, data warehouse, and data lake functionality.
Key Components
A multicloud data lake reference architecture is made up of several key components, each playing a crucial role in storing, processing, and governing your data.
Data ingestion and storage are essential components, allowing you to bring in data from various sources and store it in a centralized location. Pipelines for stream-based data ingestion and Spark structured streaming enable this process.
Data processing and continuous data engineering are also vital components, supporting concurrent writes and reads with Qubole ACID. Airflow and Scheduler orchestrate data pipelines, while Spark, Hive, and Presto data processing engines handle processing tasks.
Data access and consumption components provide users with various tools to interact with the data lake. Native applications like Jupyter notebooks and Workbench, along with connectivity options like REST APIs and SDK, Drivers, make it easy to access and consume data.
Data governance components, including Discoverability, Security, and Compliance, are crucial for managing data within the data lake. Qubole cost explorer helps with financial governance, while Ranger controls data access, and Qubole ACID handles data updates and deletes.
Infrastructure and operations components, such as Automated cluster lifecycle management, Workload aware auto-scaling, and Intelligent spot management, ensure the smooth operation of the data lake.
Ingestion
In the multicloud data lake reference architecture, the ingestion layer plays a crucial role in connecting source systems to the cloud data platform and ingesting data into cloud storage.
Data ingestion tools should connect to a variety of data sources and support both batch and real-time data ingestion.
The current best practice is to leave the data in its raw format to allow for analysis in different and novel ways without having to ingest it from the source again.
Batch and stream data processing are both supported in sophisticated cloud data platform architectures, enabling real-time analytics use cases and handling data sources that only support batch data access.
Data streaming enables real-time analytics, while batching is necessary for data sources that only support batch data access.
An Open Data Lake ingests data from various sources, including applications, databases, real-time streams, and data warehouses, and stores the data in its raw form or an open data format that is platform-independent.
The ingest capability supports real-time stream processing and batch data ingestion, ensuring zero data loss and writes exactly once or at least once.
Data is stored in a central repository that is capable of scaling cost-effectively without fixed capacity limits and is highly durable.
Storage
The storage layer is the foundation of a multicloud data lake reference architecture, and it's where you'll store all your data reliably and serve it efficiently.
Raw data lands in the data storage layer after passing through the data ingest layer, and public cloud vendors offer the most cost-effective cloud object storage, allowing you to store large volumes of data for long periods of time.
Cloud data storage is both scalable and highly reliable, with failover and recovery capabilities from vendors like AWS that are second-to-none.
You can easily grow or shrink your cloud data storage capacity as needed, making it a flexible solution for your data storage needs.
A tiered approach to data storage allows data engineers to route data into short-term or long-term storage as needed, supporting analytics and compliance use cases.
Short-term storage is used for data that will be analyzed in the immediate future, while long-term storage provides a lower-cost alternative for storing data that may not be analyzed right away.
The data storage layer contains separate object storage services for the data lake and the data warehouse side of the modern data lake, which can be combined into one physical instance of an object store if needed.
However, if your consumption layer and data pipelines will be putting different workloads on these two storage services, consider keeping them separate and installing them on different hardware.
External table functionality allows data warehouses and processing engines to read objects in the data lake as if they were SQL tables, enabling you to transform raw data before inserting it into the data warehouse.
Most MLOP tools use a combination of an object store and a relational database to support MLOps, with models and datasets stored in the data lake and metrics and hyperparameters stored in a relational database.
Data Processing
The data processing layer is a crucial component of a multicloud data lake reference architecture. It contains the compute needed for all workloads supported by the modern data lake.
Compute comes in two varieties: processing engines for the data warehouse and clusters for distributed machine learning. These processing engines and clusters are designed to support the distributed execution of SQL commands against the data in data warehouse storage.
Transformations that are part of the ingestion process may also need the compute power in the processing layer. This can include substantial extract, transform, and load (ETL) against the raw data during ingestion.
A medallion architecture or a star schema with dimensional tables may require substantial ETL during ingestion. This design choice affects the compute needs of the data warehouse.
The data warehouse within a modern data lake disaggregates compute from storage. This means multiple processing engines can exist for a single data warehouse data store.
A possible design for the processing layer is to set up one processing engine for each entity in the consumption layer. This can include separate clusters for business intelligence, data analytics, and data science.
Each processing engine would query the same data warehouse storage service, but since each team has its own dedicated cluster, they will not compete with each other for compute. This allows for efficient use of resources and reduced competition for compute power.
Machine learning models, especially large language models, can be trained faster if training is done in a distributed fashion. The machine learning cluster supports distributed training.
Distributed training should be integrated with an MLOps tool for experiment tracking and checkpointing. This ensures that machine learning models are trained efficiently and effectively.
Data Governance
Data Governance is crucial in a multicloud data lake reference architecture, especially when multiple teams start accessing data. This is because there's a need to exercise oversight for cost control, security, and compliance purposes.
Data governance helps prevent data sprawl, which can lead to wasted resources and increased costs. This is particularly important in a multicloud environment, where data is scattered across different platforms.
By implementing data governance, organizations can ensure that data is properly managed, secured, and compliant with regulations. This enables them to make informed decisions and stay ahead of the competition.
Governance
Data governance is essential for cost control, security, and compliance purposes.
Multiple teams accessing data creates a need for oversight to prevent unauthorized access and ensure that data is being used efficiently.
Expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around data deletion and erasure.
An Open Data Lake supports the ability to delete specific subsets of data without disrupting data consumption, making it easier to comply with regulations.
Non-proprietary ways to delete data are essential for ensuring data is erased in a secure and transparent manner.
Security
Data governance is not just about storing and managing data, it's also about ensuring the security and integrity of that data. Data in the lake should be encrypted at rest and in transit, and cloud providers offer services to do this using keys managed by the cloud provider or the customer.
Data encryption is a must-have for any data lake, and it's not just about protecting against external threats, but also about ensuring that sensitive data is not accessible to unauthorized personnel. An Open Data Lake integrates with non-proprietary security tools such as Apache Ranger to enforce fine-grained data access control.
To ensure that data is only accessible to authorized personnel, an Identity and Access Management (IAM) solution is required. Both the Data Lake and the Data Warehouse must support an IAM solution that facilitates authentication and authorization.
Perimeter security for the data lake includes network security and access control, which can be achieved by mapping corporate identity infrastructure onto the permissions infrastructure of the cloud provider's resources and services. This includes using LDAP and/or Active Directory for authentication.
A Key Management Server (KMS) is also essential for security at rest and in transit. A KMS generates, distributes, and manages cryptographic keys used for encryption and decryption, ensuring that data is secure both in transit and at rest.
Agnostic
An Open Data Lake is cloud-agnostic, meaning it's portable across any cloud-native environment, including public and private clouds.
This agnostic nature frees up organizations to focus on building data applications without worrying about the underlying infrastructure.
With cloud-agnostic architecture, you can leverage the benefits of both public and private clouds from an economics, security, governance, and agility perspective.
This means you can choose the best cloud provider for your specific needs, without being locked into a single vendor.
By being cloud-agnostic, an Open Data Lake can provision, configure, monitor, and manage resources as needed, regardless of the cloud provider.
The Optional Semantic
The Optional Semantic Layer is a crucial component of data governance, but it's not a necessity for every organization. A semantic layer acts like a translator that bridges the gap between the language of the business and the technical terms used to describe data.
It sits between the processing layer and the Consumption layer, helping both data professionals and business users find relevant data. This layer is optional, and organizations with few data sources and well-structured feeds may not need it.
A data catalog is a simple form of a semantic layer, which includes the original data source location, schema, short description, and long description. A more robust semantic layer can provide security, privacy, and governance by incorporating policies, controls, and data quality rules.
Large organizations with many data sources where metadata was an afterthought should consider implementing the semantic layer. This is especially true for complex industries like Financial Services, Healthcare, and Legal, where domain-specific terms can make data difficult to understand.
Platform Features
A multicloud data lake reference architecture is all about flexibility and scalability.
Modern cloud data platforms offer features like batch and stream data ingestion, tiered data storage capabilities, workload orchestration, a metadata layer, and ETL tools.
Having a simplified architecture can reduce the cost and complexity of cloud data analytics at scale.
ChaosSearch transforms your AWS or GCP cloud object storage into a hot data lake for analytics in the cloud.
Data is ingested from source systems directly into Amazon S3 object storage, where it's indexed with proprietary Chaos Index technology to create a full representation with 10-20x compression.
This allows users to query and transform data in Chaos Refinery before visualizing it and building dashboards in the integrated Kibana Open Distro.
Chaos Fabric, a stateless architecture, enables independent and elastic scaling of storage/compute resources with high data availability.
Platform Examples
In a multicloud data lake reference architecture, platforms like Amazon S3 and Azure Data Lake Storage play a crucial role in storing and managing large amounts of data.
These platforms offer scalable storage and object-level security, allowing you to store data in a secure and organized manner.
For example, Amazon S3's object-level security allows you to control access to individual objects within a bucket, giving you fine-grained control over who can access your data.
Azure Data Lake Storage, on the other hand, provides a hierarchical namespace, allowing you to store data in a structured and easily accessible way.
AWS
AWS offers a comprehensive cloud data platform architecture that integrates multiple services to manage data from various sources.
The data ingestion layer uses multiple applications and cloud services like AppFlow, Kinesis, and Datasync to capture data.
Amazon S3 serves as the primary storage for data, which is then natively integrated with Amazon Redshift for data warehousing.
A data catalog layer on top of Redshift enables data discoverability.
The data processing layer supports SQL-based ELT, big data processing, and near real-time ETL with services like Spectrum, EMR, Glue, and Kinesis Data Analytics.
AWS services like QuickSight BI and SageMaker machine learning platform allow users to consume and analyze data.
Amazon Athena is another analytics service that can be used to consume data.
Platform Examples
Amazon Web Services (AWS) is a cloud computing platform that offers a wide range of services, including computing power, database storage, and artificial intelligence tools.
Many businesses are using AWS to build scalable and secure applications.
Microsoft Azure is another cloud platform that provides a suite of services for computing, storage, and networking.
It's worth noting that Azure has a strong focus on artificial intelligence and machine learning capabilities.
Google Cloud Platform (GCP) is a cloud computing platform that provides a range of services, including computing power, storage, and big data processing.
GCP is particularly popular among businesses that need to process large amounts of data.
Salesforce is a cloud-based customer relationship management (CRM) platform that helps businesses manage their sales, marketing, and customer service activities.
It's used by many businesses to stay organized and focused on their customers.
Shopify is an e-commerce platform that allows businesses to create online stores and sell their products to customers all over the world.
Shopify is particularly popular among small businesses and entrepreneurs who want to start selling online quickly.
Sources
- https://www.chaossearch.io/blog/cloud-data-platform-architecture-guide
- https://www.ibm.com/data-lake
- https://thenewstack.io/the-architects-guide-a-modern-data-lake-reference-architecture/
- https://www.qubole.com/data-lake-architecture
- https://blog.min.io/the-architects-guide-a-modern-datalake-reference-architecture/
Featured Images: pexels.com