ZFS file recovery, slow, and expensive, but not impossible

This is just a quicky because when I accidentally deleted a Proxmox volume when migrating servers, every resource I tried basically said “it’s gone, buddy”, so I want to give someone a bit of hope that it’s not gone, buddy.

Also, check your backups are running regularly.

As an indication, this is a story of recovering a file on a ZFS RAID-Z2 pool on 18x 1.8TB 10K SAS HDDs.

The Story

Basically, I was migrating Proxmox servers and had a very large, 1TB volume on SSD with file backups that I used for a important files on a NAS locally, and a 6TB file on shared HDDs that was used for media (not backed up as it’ll be annoying but not impossible to replace). Unbeknownst to me, my off-site standard file (not volume) backup on that old server had stopped running, but I later discovered that fact when it was too late.

Since it’s quite difficult to move and recreate volumes between Proxmox servers, I migrated the 1TB volume to the HDD share, then shared it to the new server, and finally backed up the VM itself but not the huge storage devices.

Restored the back up which created the VM to match (on a new volume number), and rescanned for devices to test. The volume numbers differed so it didn’t find them, but another VM did. I then started the process of copying the volume locally to the new volume number (which took a very long time because they were 1 and 6TB respectively).

A day or so later, without thinking I noticed an extra unused volume on one of my VMs and thought it must be a remnant of its earlier migration…and deleted it, and immediately died a little inside when I realised it was just the same VM number on the new machine of the 1TB important file volume from the old machine.

That’s when I discovered my file backups weren’t running.

Story ends, I was stupid, files gone, Google says “ZFS Recovery from a RAID-Z is impossible” from lots of sources, and I later found this wasn’t true but it does require some blind faith.

Recovery

First thing I’ll immediately say is that the file has been restored and I’ve been running it for months now with backups and no troubles.

I tried a bunch of Linux FOSS tools, but I can’t remember their names now. The closest one is something to do with image recovery and that ALMOST worked, it found a bunch of files on ZFS but it didn’t support the file formats for virtual drives and I couldn’t work out how to add that support (custom file formats required a starting hash of an example file, but I couldn’t generate one to work).

Now days into it, with no success, I started looking at paid solutions and what came up was Klennet and UFS. Both were several hundred dollars. I haven’t tried Klennet and don’t know whether it works or not, but I tried UFS and it DID work, I chose it simply because it looked more professional and they had documentation and articles which clearly detailed the process so I felt like I knew what I was getting into, it also supported Linux (despite me not using it).

I ended up using the Windows version of UFS, NOT the Linux one, and my reasoning was that Windows itself wouldn’t understand my ZFS drives and shouldn’t try and mount/read/write or generally fuck with them (which ended up being true, although the Linux version probably wouldn’t have done those things either, but it COULD have). I was pretty panicked at this point.

Anyway, I built a Windows To Go USB using Rufus and left it unactivated, copied the UFS files to it and booted Windows on my server.

Installed, ran, activated.

Unfortunately I haven’t taken any screenshots or anything as I was just focussed on making it happen, and not the future article.

The UI wasn’t difficult, but also wasn’t easy since it found all drives as well as all pools and I had to make an assessment as to which to use, the actual scan and file discovery took DAYS, like 5 days. But it ended up being like 10 days because it would often stop counting down the “time to completion” so I’d think it had frozen or maybe I had gotten the settings wrong (I hadn’t), and stopped it a couple times and retried it. On the final attempt I just trusted it was working, and I could see the disk accesses as it went through each drive and monitored the disk access on Task Manager. And it did work.

Finally it returned the file system, showed the file I wanted, and recovered a 1TB file after a couple more days onto a 2TB USB HDD I ran out and bought because who has more than 1TB external USB devices anymore? Not me, apparently.

I took my USB, copied the file back onto both the new and old server…and another copy just to be sure, renamed it, mounted it and success. Everything worked as it was supposed to.

Conclusion

Hopefully someone reads this and recovers a lost ZFS file, it is not impossible despite what Reddit and lots of forums say, but it is SLOW and expensive on a RAID-Z.

Honestly I don’t use RAID-Z anymore, I exclusively use 10TB LFF drives and use a ZFS mirror, which can be natively scanned by most file recovery software, and keep a good eye on my backups. Both of which would be my first recommendations.

Since UFS worked, Klennet probably does too, but I only tried one. But be prepared to get your wallet out for either, so the files have to be worth like $600USD or more.

Leave a Reply

Your email address will not be published. Required fields are marked *