What are applications that benefit from block cloning?

mustache_kimono · on July 4, 2023

Excited, because in addition to ref copies/clones, httm will use this feature, if available (I've already done some work to implement), for its `--roll-forward` operation, and for faster file recoveries from snapshots [0].

As I understand it, there will be no need to copy any data from the same dataset, and this includes all snapshots. Blocks written to the live dataset can just be references to the underlying blocks, and no additional space will need be used.

Imagine being able to continuously switch a file or a dataset back to a previous state extremely quickly without a heavy weight clone, or a rollback, etc.

Right now, httm simply diff copies the blocks for file recovery and roll-forward. For further details, see the man page entry for `--roll-forward`, and the link to the httm GitHub below:

    --roll-forward="snap_name"

    traditionally 'zfs rollback' is a destructive operation, whereas httm roll-forward is non-destructive.  httm will copy only the blocks and file metadata that have changed since a specified snapshot, from that snapshot, to its live dataset.  httm will also take two precautionary snapshots, one before and one after the copy.  Should the roll forward fail for any reason, httm will roll back to the pre-execution state.  Note: This is a ZFS only option which requires super user privileges.

[0]: https://github.com/kimono-koans/httm

vovin · on July 4, 2023

This is huge. One practical application is fast recovery of a file from past snapshot without using any additional space. I use ZFS dataset for my vCenter datastore (storing my vmdk files). In case of need to launch a clone from a past state one could use a block cloning to bring past vmdk file without the need to actually copy the file - it saves both space and time to make such clone.

bithavoc · on July 4, 2023

Can you elaborate a bit more on how you use ZFS with vCenter? How do you mount it?

redundantly · on July 5, 2023

They're likely using a NAS (e.g., FreeNAS/TrueNAS) that uses ZFS for the underlying storage then sharing that storage by either NFS or iSCSI with their vSphere cluster. In some rare cases FC is being used instead of NFS or iSCSI.

philsnow · on July 4, 2023

It seems kind of like hard linking but with copy-on-write for the underlying data, so you'll get near-instant file copies and writing into the middle of them will also be near-instant.

All of this happens under the covers already if you have dedup turned on, but this allows utilities (gnu cp might be taught to opportunistically and transparently use the new clone zfs syscalls, because there is no downside and only upside) and applications to tell zfs that "these blocks are going to be the same as those" without zfs needing to hash all the new blocks and compare them.

Aditionally, for finer control, ranges of blocks can be cloned, not just entire files.

I can't tell from the github issue, can this manual dedup / block cloning be turned on if you're not already using dedup on a dataset? Last time I set up zfs, I was warned that dedup took gobs of memory, so I didn't turn it on.

nabla9 · on July 4, 2023

Gnu cp --reflink.

>When --reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if --reflink=auto is specified, fall back to a standard copy. Use --reflink=never to ensure a standard copy is performed."

rincebrain · on July 4, 2023

It's orthogonal to dedup being on or off, and as someone else said, it's more or less the same underlying semantics you would expect from cp --reflink anywhere.

Also, as mentioned, on Linux, it's not wired up with any interface to be used at all right now.

thrill · on July 4, 2023

FTFA: "Block Cloning allows to clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Block Cloning can be described as a fast, manual deduplication."

danudey · on July 4, 2023

As others have said: block cloning (the underlying technology that enables copy-on-write) allows you to 'copy' a file without reading all of the data and re-writing it.

For example, if you have a 1 GB file and you want to make a copy of it, you need to read the whole file (all at once or in parts) and then write the whole new file (all at once or in parts). This results in 1 GB of reads and 1 GB of writes. Obviously the slower (or more overloaded) your storage media is, the longer this takes.

With block cloning, you simply tell the OS "I want this file A to be a copy of this file B" and it creates a new "file" that references all the blocks in the old "file". Given that a "file" on a filesystem is just a list of blocks that make up the data in that file, you can create a new "file" which has pointers to the same blocks as the old "file". This is a simple system call (or a few system calls), and as such isn't much more intensive than simply renaming a file instead of copying it.

At my previous job we did builds for our software. This required building the BIOS, kernel, userspace, generating the UI, and so on. These builds required pulling down 10+ GB of git repositories (the git data itself, the checkout, the LFS binary files, external vendor SDKs), and then a large amount of build artifacts on top of that. We also needed to do this build for 80-100 different product models, for both release and debug versions. This meant 200+ copies of the source code alone (not to mention build artifacts and intermediate products), and because of disk space limitations this meant we had to dramatically reduce the number of concurrent builds we could run. The solution we came up with was something like:

1. Check out the source code

2. Create an overlayfs filesystem to mount into each build space

3. Do the build

4. Tear down the overlayfs filesystem

This was problematic if we weren't able to mount the filesystem, if we weren't able to unmount the filesystem (because of hanging file descriptors or processes), and so on. Lots of moving parts, lots of `sudo` commands in the scripts, and so on.

Copy-on-write would have solved this for us by accomplishing the same thing; we could simply do the following:

1. Check out the source code

2. Have each build process simply `cp -R --reflink=always source/ build_root/`; this would be instantaneous and use no new disk space.

3. Do the build

4. `rm -rf build_root`

Fewer moving parts, no root access required, generally simpler all around.

the8472 · on July 4, 2023

Any copy command. On-demand deduplication managed by userspace.

https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.... https://man7.org/linux/man-pages/man2/copy_file_range.2.html https://github.com/markfasheh/duperemove

aardvark179 · on July 4, 2023

It can be a really convenient way to snapshot something if you can arrange some point at which everything is synced to disk. Get to that point, make your new files that start sharing all their blocks, and then let your main db process (or whatever) continue on as normal.

ikiris · on July 4, 2023

I think the big piece is native overlayfs so k8 setups get a bit simpler.