It’s been a while since we shared the story of an incident with you, and that’s probably a good thing – most operational incidents we had in the past year were “boring” enough in nature to fix them easily. This time, we’ve got a story of a data loss, caused by pure and simple human error – and the story of how we recovered the data.
Even though it is quite embarrassing how the data loss happened, we think it’s worth sharing the story of its recovery, as it might allow you to learn a few useful things in case you ever end up in a similar situation.
As you might have seen, over the last 7 months we’ve extended our offerings beyond ticketing to allow our customers to transform their events into the digital space as long as the global pandemic makes traditional event formats impossible. The result of our effort is a joint venture called Venueless that you should absolutely check out if you haven’t yet.
One component of the virtual events we run on venueless is live video streaming. In this process, our customers use a tool like OBS or StreamYard to create a live video stream. The stream is then sent to an encoding server of ours via RTMP. On the encoding server, we re-encode the stream into different quality levels and then distribute it to our very own tiny streaming CDN.
Venueless currently does not yet include a video-on-demand component and usually, our customers record their content at the source, e.g. with OBS or StreamYard, and process or publish them on their own. However, just to be safe, we keep a recording of the incoming stream as well. This isn’t currently part of our promoted service offering, we rather see it as a free backup service to our clients in case they lose their recording. Given that we already consider it to just be a backup, we currently don’t make any further backups of this data.
Usually, we delete these recordings after a while, but in some cases, our customers ask us to get them, e.g. because their own recording failed, or because StreamYard only records the first 8 hours of every stream. Since this doesn’t happen a lot, it’s not yet an automated process in our system. Whenever a customer requests a recording we SSH into the respective encoding server and move the recording file to a directory that’s accessible through HTTP, like this:
/var/recordings $ mv recording-12345.flv public/
That’s it, we share the link with the customer, and the process is done. One of the simplest steps possible in all this. Yesterday, a customer asked us for the recordings of the two last streams of their event. Just before finishing up for the week, I wanted to supply them with the required file, SSH’d into the server, looked for the correct files and typed…
/var/recordings $ mv recording-16678.flv recording-16679.flv
Oops. I hit return before typing out
public/, and therefore replaced the last stream with the
second-last, losing one of the videos.
Having a very naive understanding of how file systems work, I knew that the
mv command has only
changed the directory listing of the file system, but hasn’t actually wiped the file from the disk,
so I knew there is likely still a chance to recover the file, if it’s not overwritten by something
else in the meantime.
Since I didn’t manage to re-mount the root partition as read-only to avoid further damage softly, I used the big hammer to remount everything read-only immediately:
# echo u > /proc/sysrq-trigger
Uhm, okay, this worked, but how do I install any data recovery tools now? After some experiments, I decided it would be easiest to reboot into the recovery system provided by our server provider Hetzner. So I configured the boot loader to boot their recovery system from the network and forcefully rebooted the server.
To be able to perform disk dumps and have some operational flexibility without downloading a 2 TB disk image to my local machine (which would take rougly a week), I also quickly purchased a Hetzner Storage Box with 5 TB space.
Just before I executed my fatal
mv command, I executed
ls -lisah to get a directory listing
of the files:
3146449 1.1G -rw-r--r-- 1 www-data www-data 1.1G Nov XX XX:XX recording-16678.flv 3146113 1.6G -rw-r--r-- 1 www-data www-data 1.6G Nov XX XX:XX recording-16679.flv
This meant I knew the inode number of the deleted file! As I mentioned before, my understanding of file systems was (and is) rather naive, and I was pretty optimistic to be able to recover the file using that information. Isn’t that sort of what a journaling file system is for?
Recovering the file this way hover appeared to be impossible. ext4magic and extundelete are powerful tools that did find some deleted files on my disk – but not the one I was looking for, even after trying different options for over two hours.
I did not spend the time to really understand how ext4 works, but from what I gathered from various blogs, I was pretty much out of luck since the inode did no longer contain the relevant information and ext4magic also wasn’t able to recover the neccessary information from the journal either.
debugfs: inode_dump <3146113> 0000 a081 0000 8503 0000 e83a c15f e83a c15f .........:._.:._ 0020 e83a c15f 0000 0000 7200 0100 0800 0000 .:._....r....... 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 0100 0000 e6eb c000 ................ 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 92d0 2cf5 0000 0000 0000 0000 ......,......... 0160 0000 0000 0000 0000 0000 0000 6fb2 0000 ............o... 0200 2000 e3fb 208a 515b 7c65 5d5a 7c65 5d5a ... .Q[|e]Z|e]Z 0220 e83a c15f 7c65 5d5a 0000 0000 0000 0000 .:._|e]Z........ 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ *
However, if you’re in a similar situation – the ext4magic how-tos are really helpful and worth a try.
There is this one other approach to file recovery that is often recommended on the internet, usually
for “small text files”: Just
grep your whole disk for known parts of its contents! So why wouldn’t
this work on larger non-text files as well?
The first problem is obviously what to grep for. The only thing I know about the missing file, apart
from its rough size, is that it’s a FLV video file. Luckily, all FLV files
that contain video start with the byte sequence
FLV\x01\x05. So let’s search our 2 TB disk for
that byte sequence and print out the byte offset of all occurences!
cat /dev/md2 \ | pv -s 1888127576000 \ | grep -P --byte-offset --text 'FLV\x01\x05' \ | tee -a /mnt/storagebox/grep-log.txt
This took roughly 7 hours. The
pv command with the (rough) total size of the disk is optional, but gives you
a nice progress bar. Overall, this took a little over 6 hours on our server.
grep works line-based, which in a binary file menas “any byte sequence between two ASCII line breaks”. The
log file therefore contained lots of lines like this:
184473878409:<some binary data>FLV<some binary data>
In total, the search found 126 FLV file headers on our disk. This was pretty reassuring, since we had 122 FLV files still known to the file system – so there are at least four FLV byte sequences without a filename!
# find /mnt/disk/var/recordings/ -name '*.flv' -not -empty -ls | wc -l 122
Now, I needed to find out which of the 126 byte sequences did not have a filename. Since I really didn’t want to spend all weekend with a deep-dive into the ext4 disk layout, I went for an easier solution: For every file still known in the file system, I computed a hash of the first 500 kilobytes of the file:
Interestingly, two files from the completely different customers shared the same hash of the first 500 kilobytes. I haven’t tested it yet, but my theory is that those were streams that just did not contain any audio or video in their first minutes, but only empty frames. However, since I knew this isn’t the case for my missing file, I felt confident in proceeding with this approach.
Next, I computed the same hash for every byte offest found by grep and compared it to the hashes found in the previous step:
This produced 5 byte offsets with a checksum – exactly what I expected. Four that really did not correspond to a file, one corresponding to a file smaller than 500 kilobytes and therefore with a different hash.
Now, all that was left to do was writing out the byte sequences of (at least) 1.6 GB starting at the five possible byte offsets. Just to be safe, I exported 1.8 GB of each:
(The usage of
tqdm is optional, but gives you a nice progress bar.)
I then downloaded the five files, and indeed, the one with the highest position on disk contained the video file I accidentally deleted. Except some very minor corruption of less than a second somewhere in the video, the video was fully recovered. Phew.
In the long term, we’ll of course work on preventing this from possibly happening again. Leaving
very specific solutions like
alias mv='mv -i' aside, we’ll obviously re-evaluate whether we need
to create separate backups of this data if our customers begin to rely on it more than we intended to,
and looking at possible video-on-demand features coming to Venueless, we’ll at some point create
a fully automated video processing pipeline that removes the manual and error-prone steps from this
The discussion on this on Twitter, Mastodon and Hacker News yielded a few points that might be good to know if you in fact ever are in a similar situation:
There are tools like foremost, PhotoRec and Scalpel that are specialized to find files in situations like this and are probably better, especially with more complex file formats. These approaches did not turn up in my Google search since I did not know “data carving” was what I was looking for.
People raised the issue of fragmentation of files on disk. This was my greatest fear in all this – if the file was (significantly) fragmented, I would have had no chance with this approach. So I kept my fingers crossed and got lucky, except for that small corruption in the middle of the video which might be the result of fragmentation.
Of course, I re-encoded the whole file with ffmpeg before I handed it out to the customer, if only to ensure I’m not sending them bytes from my desk at the end of the video that might contain data that’s nobody’s business.