Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan.
The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems.
It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration.
The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data.
Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references.
[hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered.
Pre-Requisites and Requirements
- Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
- Target is "read-only"
- Target, after initial baseline 'distcp' sync needs to support snapshots.
- Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
- Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
- Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
- Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
- Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
- At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
- Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
- Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
- Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
- Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
- Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
- When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
A Few Hints
Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with:
hdfs dfs -deleteSnapshot
As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.
Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.
If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.