Red Hat Ceph Storage is a scalable, open-source storage solution that's perfect for data lakes. It can handle massive amounts of unstructured data.
Ceph's distributed architecture allows it to scale horizontally, making it an ideal choice for large-scale data lakes. This means you can easily add more nodes as your data grows.
With Red Hat Ceph Storage, you can store and manage petabytes of data, making it a great fit for data lakes that require large storage capacities. It's also highly available, with no single point of failure.
Red Hat Ceph Storage supports a range of protocols, including S3, Swift, and RADOS, making it easy to integrate with existing applications and workflows.
Core Components
A Red Hat Ceph Storage cluster can have a large number of Ceph nodes for limitless scalability, high availability, and performance.
Each node leverages non-proprietary hardware and intelligent Ceph daemons that communicate with each other to perform various operations, including writing and reading data, compressing data, ensuring durability, monitoring and reporting on cluster health, redistributing data dynamically, ensuring data integrity, and recovering from failures.
Here are some of the key operations performed by Ceph nodes:
- Write and read data
- Compress data
- Ensure durability by replicating or erasure coding data
- Monitor and report on cluster health
- Redistribute data dynamically
- Ensure data integrity
- Recover from failures
Crush Ruleset
A Ceph cluster can have a large number of nodes for scalability, high availability, and performance.
Each node leverages non-proprietary hardware and intelligent Ceph daemons that communicate with each other to perform various operations, including writing and reading data, compressing data, ensuring durability, and monitoring cluster health.
The CRUSH algorithm is used by Ceph clients and Ceph OSDs to perform these operations seamlessly.
A Ceph CRUSH ruleset is assigned to a pool, which identifies the primary OSD that contains the placement group for an object.
The CRUSH map defines a hierarchical list of bucket types, which are located under "types" in the generated CRUSH map.
Here are some examples of bucket types that can be used in a CRUSH map:
- Drive type
- Hosts
- Chassis
- Racks
- Power distribution units
- Pods
- Rows
- Rooms
- Data centers
The purpose of creating a bucket hierarchy is to segregate leaf nodes by their failure domains and/or performance domains.
Administrators can define the hierarchy according to their own needs if the default types don’t suit their requirements.
Ceph supports a directed acyclic graph that models the Ceph OSD nodes, typically in a hierarchy.
This allows administrators to support multiple hierarchies with multiple root nodes in a single CRUSH map.
Mandatory Exclusive Locks
Mandatory Exclusive Locks is a feature that locks an RBD to a single client, if multiple mounts are in place. This helps prevent write conflicts when multiple clients try to write to the same object.
This feature is built on object-watch-notify and is essential for operations like snapshot create/delete. It provides protection for failed clients, preventing them from corrupting a new image.
Only one client can modify an RBD device at a time with Mandatory Exclusive Locks enabled. This is especially important when changing internal RBD structures.
You need to explicitly enable Mandatory Exclusive Locks with the --image-feature parameter when creating an image. For example, to create a 100 GB rbd image with layering and exclusive lock, you would use a command with the number 5, which is the sum of 1 (layering support) and 4 (exclusive locking support).
Mandatory Exclusive Locks is also a prerequisite for object map. Without enabling exclusive locking support, object map support cannot be enabled.
Peering
Peering is a fundamental process in Ceph that ensures consistency across multiple OSDs. Ceph stores copies of placement groups on multiple OSDs.
These OSDs "peer" check each other to ensure they agree on the status of each copy of the PG. Peering issues usually resolve themselves.
The Primary OSD in an Acting Set is responsible for coordinating the peering process for that placement group. It's the only OSD that will accept client-initiated writes to objects for a given placement group where it acts as the Primary.
The Acting Set is a series of OSDs that are responsible for storing a placement group. An Acting Set may refer to the Ceph OSD Daemons that are currently responsible for the placement group, or the Ceph OSD Daemons that were responsible for a particular placement group as of some epoch.
If the Primary OSD fails, the Secondary OSD becomes the Primary, and Ceph will remove the failed OSD from the Up Set. This ensures the placement group remains available for client access.
The Solution
In our quest to build a high-performance storage system, we achieved a staggering 79.6 GiB/s aggregate throughput from a 10 node Ceph cluster.
This was made possible by utilizing TLC SSDs for metadata and directly attaching them to Seagate Exos E storage enclosures containing 84 high capacity enterprise disk drives for object data.
We combined these drives with Ceph's space efficient erasure coding to maximize cost efficiency.
At 90% disk utilization, our cluster remained performant, thanks to careful optimization efforts.
Our cluster utilization figures showed an impressive 87.49% RAW USED percentage, with 11 PiB of storage utilized out of 12 PiB available.
We've been pushing Ceph's limits, storing 1 billion objects in a 7 node cluster in February 2020, and scaling up to store 10 billion objects in a 6 node cluster by September of the same year.
Ceph's algorithmic placement allows it to store a large number of objects relative to the number of nodes, making it capable of protecting and providing high-throughput access to trillions of features.
Native Protocol
The Ceph Storage Cluster has a native protocol that provides a simple object storage interface with asynchronous communication capability. This interface is designed to meet the needs of modern applications.
With the Ceph client native protocol, you can access objects directly throughout the cluster in parallel. This means faster data retrieval and processing.
The protocol offers several key operations, including pool operations, snapshots, read/write objects, and more. Let's break down some of these operations:
- Pools: These are collections of objects, and the protocol allows you to create, manage, and access them.
- Snapshots: You can take snapshots of your data to create a point-in-time copy, which is useful for backup and recovery purposes.
- Read/Write Objects: The protocol enables you to read and write objects directly, allowing for efficient data access and modification.
- XATTRs and Key/Value Pairs: You can create, set, get, and remove extended attributes (XATTRs) and key/value pairs, which are useful for storing metadata.
- Compound operations and dual-ack semantics: The protocol supports compound operations, which allow you to perform multiple operations as a single unit, and dual-ack semantics, which provide a way to handle acknowledgments and errors.
By using the Ceph client native protocol, you can build efficient and scalable applications that take advantage of the Ceph Storage Cluster's capabilities.
Object Map
An object map is a crucial component in many computer systems, including graphics rendering and game development. It's essentially a data structure that helps the system understand the spatial relationships between objects in a scene or game world.
An object map can be thought of as a dictionary that stores information about each object, such as its position, size, and properties. This information is used to determine how objects interact with each other and their surroundings.
In a 2D game, for example, an object map might store information about the position and size of each character, enemy, and obstacle on the screen. This information is used to determine collisions, rendering, and other game logic.
A well-designed object map can greatly improve the performance and efficiency of a system, especially in complex scenes or game worlds.
Data Storage
Red Hat Ceph Storage is a scalable, open source, software-defined storage platform that combines the most stable version of the Ceph storage system with a Ceph management platform, deployment utilities, and support services.
It's designed for cloud infrastructure and web-scale object storage, making it a reliable choice for storing large amounts of data.
Red Hat Ceph Storage clusters consist of five types of nodes: Red Hat Storage Management node, Monitor nodes, OSD nodes, Object Gateway nodes, and MDS nodes. Each node plays a crucial role in managing and storing data within the cluster.
These nodes work together to provide a robust and scalable storage solution that can handle the demands of big data and cloud infrastructure.
Replication
Replication is a crucial aspect of Ceph storage, allowing for high data availability and safety. Ceph OSDs can contact Ceph monitors to retrieve the latest copy of the cluster map.
In a typical write scenario, a Ceph client uses the CRUSH algorithm to compute the placement group ID and the primary OSD in the Acting Set for an object. The primary OSD then finds the number of replicas it should store, which is determined by the osd_pool_default_size setting.
Ceph OSDs use the CRUSH algorithm to compute where to store replicas of objects, ensuring that data is distributed across the cluster. This allows for data to be read and written even if one of the OSDs in an acting set fails.
The primary OSD and secondary OSDs are typically configured to be in separate failure domains, which CRUSH takes into account when computing the IDs of the secondary OSDs. This ensures that data is not lost in the event of a failure.
Ceph defaults to making three copies of an object with a minimum of two copies clean for write operations. This means that even if two OSDs fail, data will still be preserved, although write operations will be interrupted.
For erasure-coded pools, Ceph needs to store chunks of an object across multiple OSDs to operate in a degraded state. The supported jerasure coding values for k and m are:
This allows for data to be read and written even if some of the OSDs fail, ensuring high data availability and safety.
Data
Data is a complex and multifaceted concept, but at its core, it's simply a collection of information.
There are many types of data, including structured data, which is organized and easily searchable, and unstructured data, which is free-form and difficult to analyze.
Structured data is often stored in databases, which are designed to quickly and efficiently retrieve specific pieces of information.
Unstructured data, on the other hand, is often stored in files or documents, which can be more challenging to work with.
Data storage systems, such as hard drives and solid-state drives, are designed to hold and manage large amounts of data.
These systems come in a range of capacities, from a few gigabytes to several terabytes, making them suitable for everything from small personal projects to large-scale enterprise applications.
The way we store and manage data has a significant impact on our ability to access and use it effectively.
Data Integrity and Availability
Data integrity is crucial for a reliable data lake, and Red Hat Ceph Storage has it covered. Ceph provides mechanisms to guard against bad disk sectors and bit rot through scrubbing and CRC checks.
Scrubbing is a daily process where Ceph OSD Daemons compare object metadata with its replicas, catching bugs or storage errors. Deep scrubbing, performed weekly, finds bad sectors on a drive that weren’t apparent in a light scrub. Ceph can also ensure data integrity by conducting a cyclical redundancy check (CRC) on write operations and storing the CRC value in the block database.
This ensures data integrity instantly on read operations, where Ceph can retrieve the CRC value and compare it with the generated CRC of the retrieved data.
Erasure Coding
Erasure coding is a way to protect data from loss by splitting it into multiple pieces and distributing them across a cluster of nodes. This approach allows for the reconstruction of data even if some nodes fail.
Data is split into smaller pieces, called fragments, which are then encoded and stored across different nodes to ensure durability. Each fragment is an independent piece of data that can be reconstructed from the others.
Erasure coding can be used to protect against node failures, and it's particularly useful in distributed storage systems. By distributing data across multiple nodes, erasure coding provides a high level of data durability and availability.
The goal of erasure coding is to ensure that data can be reconstructed even if some nodes fail, and it's often used in conjunction with replication to provide additional protection.
Data Integrity
Data integrity is a crucial aspect of maintaining reliable data storage. Ceph provides a robust mechanism to guard against bad disk sectors and bit rot through scrubbing.
Scrubbing is a process where Ceph OSD Daemons compare object metadata in one placement group with its replicas in other placement groups. This is usually performed daily and catches bugs or storage errors.
Deep scrubbing is a more thorough process that compares data in objects bit-for-bit, usually performed weekly. It finds bad sectors on a drive that weren’t apparent in a light scrub.
Ceph also ensures data integrity by conducting a cyclical redundancy check (CRC) on write operations. This CRC value is then stored in the block database.
On read operations, Ceph retrieves the CRC value from the block database and compares it with the generated CRC of the retrieved data. This ensures data integrity instantly.
High Availability
High availability is crucial for any data storage system. Ceph, for example, must maintain high availability to ensure clients can read and write data even when the cluster is in a degraded state, or when a monitor fails.
Ceph's CRUSH algorithm enables high scalability, but it's not enough on its own. The cluster must be able to handle failures and still provide access to data.
In a degraded state, Ceph clients can still read and write data, which is a testament to the system's high availability. This is made possible by Ceph's design, which allows it to continue functioning even when some components are down.
Ceph's high availability is essential for applications that require constant access to data, such as databases and file systems. These applications rely on the system being always available, even in the face of hardware failures or network issues.
Frequently Asked Questions
What is Ceph Redhat?
Red Hat Ceph Storage is a scalable, open-source storage platform for cloud infrastructure and web-scale object storage. It combines the Ceph storage system with management tools and support services for a robust and efficient storage solution.
What is the difference between big data and data lake?
Big Data refers to vast amounts of data, while a Data Lake is a storage repository that holds this data in its raw form, regardless of its structure or origin
What kind of storage is Ceph?
Ceph is an open-source software-defined storage solution that offers block, file, and object storage capabilities. It's a scalable storage solution ideal for high-growth data storage needs.
Sources
- https://docs.redhat.com/en/documentation/red_hat_ceph_storage/6/html-single/architecture_guide/index
- https://ceph.io/en/news/blog/2021/diving-into-the-deep/
- https://red-hat-storage.github.io/ceph-test-drive-bootstrap/
- http://www.hyperscalers.com/red-hat-ceph-hyperscale-storage-block-object-nas-san-appliance-buy-reccord
- https://tfir.io/share-ceph-storage-between-kubernetes-clusters-with-openshift-container-storage/
Featured Images: pexels.com