Backup Manager
Open, NormalPublic

Description

Dan noted that often you want to backup an entire magazine/drive no matter what the file structure is on it and just create an 100% identical clone of it.
This should be one backup mode. it could be called "clone mode" or "1:1 backup" ?

Another mode where the process is more interactive is this concept for a "Tabbed View Mode". Often you backup partly filled record media and after the backup is complete you decide to not delete the record media as there is still space left. Next time you back up you realize you have to figure out what you already backed up and what is new. This mode is designed to help here:

Related Objects

sebastian updated the task description. (Show Details)
sebastian raised the priority of this task from to Normal.
sebastian assigned this task to BAndiT1983.
sebastian added a project: Open Cine.
sebastian moved this task to Pending tasks on the Open Cine board.
sebastian added subscribers: sebastian, dan45.

I've already considered to have one-way sync. OC would check what's already available at backup location, verify checksum to be really sure about the data and skip the files/items or at least ask the user what to do. There is already an "Overwrite" check box which is meant to bypass this check (not implemented yet).

Clip recongnition would come into play at the latest when you are in the "Manage" mode, which would represent some sort of library or like RawTherapee/Darktable which would let you import folders or clips for the current grading/post-rpocessing session. This would give the possibility to quick switch between different clips. There we could use data filters to recognize things automatically and to simplify the process for the user.

Any ideas what the tabbed mode for backup should show at the moment, instead of "Clip"? I considered some sort of tasks ->

Example:

Gathering drive information
Creating folder structure
Copying files (show current file name and checksum state)
Checksum (this could be also done on the fly when copying files)

Each of this tasks would be accompanied by progress bars and one general would be at the bottom of the progress dialog.

Did a little PDF export test for later transfer report feature. It was done quick, so i don't know about limitations so far, but what i've tried was sufficient for reports. One thing which is not clear to me yet is the resolution of output and how to calculate how many items fit on one page to generate next pages.

dan45 added a comment.May 26 2015, 9:33 PM

PDF report example to compare to note the thumbnail down the left leaving the right side for all the metadata and other shot data, dates etc.

BAndiT1983 added a comment.EditedMay 26 2015, 9:43 PM

My test was only related to Qt and PDF output, i've grabbed just the thumbnail view to get some output. Real one will be implemented soon, but other tasks like extended progress dialog and checksum have higher priority.

Nevertheless i grabbed your TIFF file to have some example. Have to got through our e-mail discussions for other examples you've sent.

As there weren't much updates lately here are the current tasks which i (slowly) work on.

Tasks are ordered by their priority and when one is finshed next one will be undertaken to maintain some common thread.

  • Give progress dialog a new layout to represent ongoing backup tasks
  • Extend the backup class with checksum validation and other required functionality
  • Proceed to implement DNG decoder which would open CinemaDNG files in a fast way (skipping check of every file header to retrieve image data)
troy_s added a subscriber: troy_s.Jul 6 2015, 3:33 AM

I'd encourage this to be a standalone unit as it is with almost every other camera manufacturer.

Basic entry point might be an rsync-ish wrapper with support for SHA and MD5 checks.

Even Arri suggests using rsync as a base as per their data management page.

A few points that I see as important parts:

  • 1:1 copy as mentioned earlier. (Maybe with an option to exclude .Spotlight, .Trash etc. when on a mac that sadly creates them directly when connecting the first time ...)
  • Just have a list of targets where you can add from one to multiple (not only two) targets
  • Parallel Copy: Read from source to RAM and copy to all target drives in parallel to speed up progress
  • Parallel verification: Some simple tools do a checksum file(or to ram) from the source and check the targets then. Skip this and do checksums of source and all targets at once and compare then.
  • add a project mode where you can set the target drives and directory scheme for this project that can be loaded as a starting point
  • variable based directory scheme so that you can have something like "/media/irieger/MY_EXTERNAL_DRIVE/OperationApertus/Footage/%Y-%m%-%d/%camNo"
  • an option to add another transfer to a queue that fill be processed when the current is finished
  • an option to start a parallel file transfer (what I think about here is if you need to copy a small card from the sound guy who wants to finish in the evening and not have to let him wait until all the RAW footage is backed up or something similiar)

I know this is a bunch of points but I just wanted to offer you what I have in mind that my media manager would need. Got creative after writing the first points...

I'd do a clear interface separation between the Backup Manager part and the later development part.

PS: Maybe I find some time in the future to support you. I have written a small python file copy tool that does everything in parallel and have planned to do it with a proper Gui but haven't found time yet.

Also very helpful and valid things. The decoupling of Backup is progressing and I'll move parts of the current layout over to new OCBackup project (not commited yet).

New structure, even more decoupling, less monolithic:

  • OCCore - contains main things for OC, like rsync wrapper, checksum validation or wrapper for image decoder libs etc., if you have some suggestion for what it should also contain, feel free
  • OCui - at the moment it contains a GUIApplication class (derived from QApplication) to give some base for graphical OC tools, which also has some sort of template window and also a global OC stylesheet to let all the tools to look similar to each other
  • OCBackup - backup module, derives for GUIApplication, used as playground for testing of the new structure at the moment

I hope that porting to Windows would move faster this time, as many dependencies will be more hidden to not affect derived projects/modules.

Cool, I will keep an eye on your project an maybe find some time to help. Would be cool to finally have a nice data management tool. Have tried a few of the commercial ones and none of them really satisfied me so there is some space to fill.

BAndiT1983 added a comment.EditedJul 16 2015, 11:27 PM

I know that everyone demands rsync or similar thing, but e.g. librsync depends on Cygwin and I try to avoid many dependencies. Still searching for alternative libs before rolling some sort of own implementation. OC doesn't need the network part of rsync, as Qt would be used to copy (or plain C++11), which should be portable. Also constrain OC to local drives (built-in oder removable ones) for now. Later OC would be sending data over local network (not internet, not for now at least), but this should be similar to the "local" workflow inside the application.

How does your Python script work? How is the parallel coyping accomplished? Besides, most data in C++ is usually processed by pointers and stored in RAM, so there is not a big problem with that. Also copying is done by 2 lines, haven't tested it with big files yet. But to be sure that input and output are equal that's another part of the story.

My first idea for sync was to get MD5 of source file, copy, get MD5 of target file, compare, maybe recopy if required. All the example files i saw, mainly folders full of CDNG, were a couple (maybe up to 20, if at all) of megabytes in size. Recopying some of the files shouldn't be a real problem. But to verify it more tests are required.

I know that everyone demands rsync or similar thing, but e.g. librsync depends on Cygwin and I try to avoid many dependencies. Still searching for alternative libs before rolling some sort of own implementation. OC doesn't need the network part of rsync, as Qt would be used to copy (or plain C++11), which should be portable. Also constrain OC to local drives (built-in oder removable ones) for now. Later OC would be sending data over local network (not internet, not for now at least), but this should be similar to the "local" workflow inside the application.

I don't see rsync as a main need. Mostly it is a fast local transfer you need onset or so not the kind of network transfer or you use a network file system and can use the tool as if you are working local.

How does your Python script work? How is the parallel coyping accomplished? Besides, most data in C++ is usually processed by pointers and stored in RAM, so there is not a big problem with that. Also copying is done by 2 lines, haven't tested it with big files yet. But to be sure that input and output are equal that's another part of the story.

Very quick and dirty solution. Have sent you a PM about it cause I don't feel to let this shitty script out into the wild ...

My first idea for sync was to get MD5 of source file, copy, get MD5 of target file, compare, maybe recopy if required. All the example files i saw, mainly folders full of CDNG, were a couple (maybe up to 20, if at all) of megabytes in size. Recopying some of the files shouldn't be a real problem. But to verify it more tests are required.

That is what a small python script of a friend used to do. Build MD5 hashes, copy and then verify. But my concern with it is, that if you have fast RAIDs and an SSD source (assuming all are about the same speed in read/write and copying and hashing takes the same time), this approach takes 50% more time than just copying everything and building hashes in parallel for source and destinations and compare them afterwards or in parallel.

As I work with the Blackmagic Production Camera mostly in 4K raw the cam fills a 480 GB SSD in about 20-30 minutes recording time. Copying and verifying is taking its time even with two RAIDs as destinations so its good to optimize such a tool to work well with the masses of data raw shooting is bringing.

RainerFritz added a subscriber: RainerFritz.EditedJul 27 2015, 12:38 PM

From my perspective the copy and verification task should run as read once write to multiple destination in parallel.
I played with python a little bit and I read the files in binary mode so the source checksums could be generated in parallel
to the copy task which is essential.
The slowest destination will then set the copyspeed... I thought on a dynamic buffer size per destination write speed or a SSD as additional
buffer to compensate too big storage speeds differences.
The buffer is then written in parallel to all destinations. When the files where read back for destination checksum generation,
it would be fine if there would be a choice how much files you want to process in parallel. With picture sequences it would be nice
to process at least 4 files parallel per backup. If there is a conatinerformat in the future, processing one after one could be faster. I used md5deep/hashdeep
very often which is very fast. md5deep
After verification if there are mismatches on checksums or missing files etc. it should ask to recopy/reverify those files.
A copy report in pdf format would be also nice with or without thumbnails from the beginning, middle and end of the clip.
I uploaded a sample here
A textfile with all checksums should be stored with every backup at the destinations as the report.

Hey Rainer,

while I like your concept there is one thing I'd change: Don't generate the checksum of the source on the copy read. Do another read for checksum generation to reduce the risk of missing a random bit shift in the file reading chain. With my offload tool I once had a checksum error which was clearly coming from a bit shift in the read process. Yes, it was only one or two times in many Terabytes of data but in can screw a maybe very expensive shot so better safe than sorry. Most checksum errors I had where on the destination side and they are rare thought.

BAndiT1983 added a comment.EditedJul 27 2015, 12:59 PM

Just for my understanding, would you do it like this or in some other way:

  • Get all folders and files to be copied
  • (Filter out folders with . (dot) or similar ones)
  • Create folder structure at target location(s)
  • Copy files
  • Create checksums for source files
  • Create checksums for target files
  • If failed for some files, give user the possibility to recopy
  • If recopy was selected re-verify this files

Have i forgotten something?

Hi !

irieger:
For example you have a shooting day where you have to deal with let's say 1,5TB of data reading the source twice costs a lot of time.
As you need to do an "optical" check to the material (hashes can not see problems with the picture) and process it then for example to an offline material as well.
So you end up by reading the source files on set at least three times with no parallel processing in the copy process that would be
a 4th time.
All copy programs I used at work including yoyotta are doing this. Think of the fact that you can have a multicamera show where you
need to handle two or three times the data I mentioned above. It could be less, but I would highly suggest to do parallel processing on the copy task.

BAndiT:
Copy files and create checksums on source files parallel would be my suggestion
everything else is like i meant
at the end of the list could be "generate copy report" and "place checksum files/reports at destinations and optional selected folder"

Hi !

irieger:
For example you have a shooting day where you have to deal with let's say 1,5TB of data reading the source twice costs a lot of time.
As you need to do an "optical" check to the material (hashes can not see problems with the picture) and process it then for example to an offline material as well.
So you end up by reading the source files on set at least three times with no parallel processing in the copy process that would be
a 4th time.

Nope, my script reads two times. One for copies (coping in a buffer and writing to all destinations in parallel). And one time for checksum.

All copy programs I used at work including yoyotta are doing this. Think of the fact that you can have a multicamera show where you
need to handle two or three times the data I mentioned above. It could be less, but I would highly suggest to do parallel processing on the copy task.

No Problem, as I said when I first described my tool: Read for coping, and then read for checksumming. With the camera SSDs being very fast today you wouldn't really see a difference when reading it again for checksuming while generating the checksums for the destinations in parallel. And then compare them from the checksum list in RAM which is very fast. As I said I wouldn't rely on one read from the source only.

My problem on set was never the speed of the source, always the copy destinations speed. Always slower, at least when shooting raw. Ok, a 8 bay RAID may change that but when working with raw the camera SSDs are pretty fast anyways.

And on a multicam show I would imagine to have two readers if the RAIDs are faster so you could have two jobs running in parallel maybe?

Or leave the user the decision in a config setting. For me it doesn't make sense if doing checksums and all, to skip this additional step that can run in parallel and doesn't hurt anyone.

yoyotta has the checksum read on source as option to verify data integrity which sounds to me as a good solution.
copying parallel more then one job/file is not a good solution, because then your frames are not stored physically in a sequence and you can
get problems with readspeed when you are playing them because they could be fragmented. Specially with 4k frames. I use a 12bay SAS storage on set as main storage which
provides about 1200MB/s read/write and 6bay Thunderbolt/USB3 Areca Raids which provide about 700MB/s.
With picture sequences as it is with Arri Raw or DNG etc. you have a high load on I/O operation as thats a huge bunch of single files.
An option for an in camera MXF wrapping around the raw files would be very nice!

A time calculation:
You get a full SSD with 512GB which has a real readspeed of 350MB/s which is the actual speed of the SSDs used with the Gemini or Odyssey recorders
of Convergent Design when using md5deep reading 4 files parallel when hashing, copy speed with cp goes up to about 400MB/s.
The checksum calculation on the fly reduces in my tests and at lot of other copy programs the speed to about 230MB/s.
1280s + 1462s = 2742s = 45min.
2226s = 37min (worse case).
now there are a few programs out which going up to the speed of a simple copy task, the time saved can then be nearly the half.
and trust me thats a lot. To wait 45min or 24min to finish the task is a difference. When you read the checksums back from destinations,
it should go as fast as possible. With md5deep I reach on my 12bay about 800MB/s hashing speed which is not bad and should be the goal
also for opencine. The backup task should be as fast and as secure as possible.

So it could a good solution that reading of two files in parallel from the source and then writing them one after one to the destination.
To hold two raw frames in RAM should be no problem to increase read speed if the destination storage is fast enough.
As said, the slowest destination will set the write speed.

I was on a set with three Alexa XTs capturing RAW openGate, and trust me backup speed is a real issue, even when the Axiom camera
will capture 4k uncompressed raw.
An option for lossless 2:1 or 3:1 in camera compression would be nice, specially when DNG will be used.

Ok, I feel with you. Haven't had such a fast RAID yet. Copying one evening worth of Arriraw (only OpenGate 75fps and 120fps 2.8K 16:9, no normal speed, all maximum) skater footage with 1,5 TB in a few hours was kind of hard to a 4 bay RAID5 TB-Array (was a fun project and no second target (expect of slow USB3 single drives) was available.

And I was doing data wrangling on a short film lately shot on my Blackmagic Production Cam. (Compressed DNG 4K for some days, which are about 24-30 minutes per 480 GB disk). Only had my RAID and a few single drive 2.5" HDDs but copying only to the RAID was no option. No fun either. But in this cases for me always the destination disks where slow. And most times, while being a pain it was no problem to have a slow copying. Have enough cards and you are good to go. So I'm voting for the option to either generate MD5 on the first read or do a second pass. Having this option hurts noone.

PS: 45 Min for one SSD is fast, don't get the problem ;-) In our student/low budget area it would be cool to have such speeds ...

After some on and off OC development (regular job causes usual lack of time), here is the current state of removable drive list:

I've enabled all drives instead of only removable ones to have something to show, cause my card reader allows only SD or Micro-SD at the same time and USB drives are somehow recognized as local ones.

BAndiT1983 added a comment.EditedSep 20 2015, 8:09 PM

References to find optimal block size for transfer:

Windows: http://stackoverflow.com/questions/3478610/how-do-you-determine-the-optimal-disk-io-block-size-on-win32
Linux: http://man7.org/linux/man-pages/man2/statfs.2.html

Edit 1: Had to add usleep() of 500ms, else wrong size is returned by statvfs(). Possibly device is not fully mounted at the moment statvfs() is called. Temporary solution for now.

BAndiT1983 added a comment.EditedSep 23 2015, 4:48 PM

Current drive list state in Linux (adjusted visuals):

TODO:

  • Add pagination to speed up thumbnail view (QML), possible performance impact when image loading will be added
  • Add pre-filter to recognize CDNG folders and present them as clips, otherwise show only supported image files
  • Add LibRaw adapter as fallback, currently only TiffLoader is present for CDNG
d0 added a subscriber: d0.Feb 27 2017, 1:49 AM

Warning: In this comment I will state some experiences I made with data wrangling (as it is lovingly called in the industry) and shamlessly post some vague GUI inspirations. When I shot my last film on a Blackmagic Production 4K Camera, I had a lot of RAW DNG streams to wrangle with – we recorded about 8TB in 10 days and those had to be backupped at least once. If one doesn't has a dedicated person on set to deal with the data flexibility is key.

You want: to copy from one to one, one to many, many to one, many to many.
You don't want: to find out while beeing stressed on a film set that something is not possible, or someone has to guard the computer the whole next hour to click a box that pops up.

  • Has to work till everything is on drive without supervision (unless explicitly demanded: no popups, no stopping)
  • Every copy process should be stoppable and interuptable, without waiting minutes without feedback or shutting down software because it won't let you eject your drive safely. This ejecting and resuming later part should be idealy built into the software.
  • Sometimes you dont wanna hash but copy quick and dirty, because you just wanna check the grain or light or whatever. So there needs to be a option to copy single files as fast as possible
  • GUI has to be idiot proof. Use colors, symbols prompts (Red: bad, Green: good, Orange: in Progress). I once made a custom GUI in python where you had to press a button for 4 seconds before it would do something. This is cool because it gives stressed out people a few seconds to release the mousebutton before they do something stupid. And it is IMHO better than a prompt which you just click without thinking if you're stressed. This is a animated gif:
  • Don't assume the things connected will work. Imagine the SSD-reader has a broken cable and it connects and disconnects fast enough to not be visible. Key here is to help users identify such problems fast and definitly.

A first mockup of an app I was working on, that was precisely meant for this task is visible here:

While I did a basic draft of this using pyglet and python (one had to drag the spliny arrow in proximity of the drive), I would scrap the interface and keep the useful parts:

  • A display of remaining and recorded time on all connected drives. In my mockup it just assumes 4K-RAW and this is easy for uncompressed data streams, can get a lot harder with compressed ones.
  • A display of the date and time when the thing will approximately finish. This is extremely useful to plan your and your teams day : )
  • A simple color change to indicate on first glance which drives are full and which are empty (if done well, it would give you red if you can only record 15 to 20 mins).

For loading bars I always liked the Graph style thingie

(see Teracopy)