decide architecture for integration with NLE (Lumiera)
Open, Needs TriagePublic

Description

consider the options we have to integrate OpenCine with an external editing system (here: Lumiera)

  1. not integrating at all
  2. command with pipeline in/out ("the unix way")
  3. IPC -- OpenCine as a service
  4. plug-in
  5. shared OpenCine library
  6. sourcedode include

decide on best fit, or maybe on several, with prioritisation.

Ichthyo created this task.Nov 5 2015, 10:42 PM
Ichthyo updated the task description. (Show Details)
Ichthyo raised the priority of this task from to Needs Triage.
Ichthyo assigned this task to BAndiT1983.
Ichthyo added a project: Open Cine.
Ichthyo added subscribers: Ichthyo, BAndiT1983.

My preference is a shared lib, as OC is already using OCcore for processing. Lumiera can be equipped with OC plugin which uses the lib One thing bothers me the most. I'm using Qt5 core features for disk IO (folder/file enumeration and similar) in OCcore module. Is it a problem for you to have QtCore as dependency for an OC adapter (at least for now)?

Ichthyo added a comment.EditedNov 6 2015, 12:07 AM

fully agreed.

A shared lib is the best recommendation I can give for this task, since it generates the least amount of complexity at the interface.

One thing bothers me the most. I'm using Qt5 core features for disk IO (folder/file enumeration and similar) in OCcore module. Is it a problem for you to have QtCore as dependency for an OC adapter (at least for now)?

no, this is never a problem on the technical level. Because

  • you (OpenCine) decide what functionality to offer through this lib
  • you define the lifecycle protocol, i.e. what init / cleanup functions the user of the lib has to call and what services the host application has to offer to support processing
  • you define the further (transitive) dependencies of your library. Anyone linking (dynamically) against that library needs the library installed and thus has to accept those dependencies

So this just boils down to a question of design, which IMHO is not so urgent. As we've detailed in our private mail discussion, we (Lumiera) want to build a very performant, memory-mapping based file-I/O backend, and we want to schedule and optimise the I/O bandwith during the rendering phase. We do not care for any regular IO operations happening outside of that scope, nor do we care for other threads or processes.

  • in a first draft version, we would be fine just calling one processing function. Bonus points if we can somehow guess the additional IO bandwidth, such a OpenCine processing function would create on the system (because we would then just lower the IO bandwith we are creating accordingly)
  • as a second step, we could improve the interface definition and modularise the resource usage even more. OpenCine would then e.g. define that it needs a function to load part X..Y of file ABC into a buffer, and Lumiera would offer this service.

incidentally, while we're just at that topic: I have figured out quite well how DEB packaging works meanwhile. So I offer to help with that task, i.e. I can help with or care for getting either just that lib or OpenCine as a whole packed in a proper way for Debian and derivatives (Ubuntu, Mint, SteamOS :-D ), and I'll show you how to keep that stuff manageable.

you can get into "dependency hell" with libraries. But the key trick is just to allow for a certain (limited) compatibility, i.e. try not to rely on the outmost "bleeding edge" all the time. If you do so, then you're able to offload the daunting task of dependency management to the Linux distributors. These folks perform this task quite well, and this is a key strength of the Linux world...

DEB packaging will be an important point, so i appreciate your offer and will refer to it some time in the future. To automate the process i've planned to use Jenkins as a build system (local, for now). It would also accomplish a lot of other things, like performing automatic test runs and also to ensure clean builds, because a developer machine is usually messy as hell with all the libs and other dependencies.

I will answer on the other topic as soon as i have reflected on it. But here are two things which i would use for memory mapped IO:

  • (portable) Boost.Interprocess (just this header-only lib without other Boost clutter, not much a fan of the lib as it's very heavy and hard do dissect)

or

  • (portable) memory mapping of Qt5.5 (QFileDevice if i remember correctly)

DEB packaging will be an important point, so i appreciate your offer and will refer to it some time in the future.

count on me for that. See T610

To automate the process i've planned to use Jenkins as a build system (local, for now).

Indeed, works well for that task. I certainly helps if you're familiar with the Java/EE world, because then Jenkins mostly just feels natural. For this specific task, we should also have a look into Docker, which is well integrated with Jenkins. We should note one point though: Linux packaging is not really "continuous delivery", it is just a different (more conventional) philosophy. Creating a new package from each commit doesn't make sense. But doing frequent micro releases, e.g. once a week or even a "nightly" works well.

Ichthyo added a comment.EditedNov 7 2015, 11:45 PM

I will answer on the other topic as soon as i have reflected on it. But here are two things which i would use for memory mapped IO...

here I tend to disagree, from an architecture viewpoint.
Whatever OpenCine does internally is purely a local decision within the OC developer community. Personally, I'd recommend against picking a technology (here: memory mapping) before having investigated the pros and cons. Beware, the semantics of memory mapping are non trivial and differ between the platforms. On Linux, it is known to deliver superior performance, but I can not judge if its performance gain is worth the effort on other platforms. For a portable application I'd KISS (stick to plain flat standard I/O) and optimise later (but that's my opinion).

For the sake of integration in the rendering process, I see only two options

  • either, OpenCine does the IO. Then we (Lumiera) don't really care about what technology is used and how.
  • or, the rendering host (here: Lumiera) does the IO and delivers data in a buffer to OpenCine. Then and only then, we (Lumiera) are responsible for the IO techology, and since we are primarily Linux, we will use memory mapping directly, i.e. use the mmap() (posix) system call.

I've played with Docker some time ago, but don't know how it helps to stay "portable" between Linux versions. How would it benefit Lumiera or OC?

Started to experiment with plain raw data an to reflect on the structure of IFrameProcessor. What granularity is on your mind for the interaction between OC and Lumiera? Say we have some folder which is full of CDNGs:

  • Does Lumiera ask OC to load files/frames and receives them through some defined structure together with meta data.
  • Or does it load the files itself and asks OC to de-Bayer and process them?
  • Or maybe one of the described things above, depending on the situation and format? Lumiera decides then what to do and what parts of OC it uses to accomplish the task.

What granularity is on your mind for the interaction between OC and Lumiera?

intuitively, I'd say the granularity is one frame.
Let's reconsider that.

  • to process more than one frame does not yield any benefits. You'd re-use the buffer for each frame, wouldn't you?
  • less than one frame would mean to "tile" the image. Indeed, we could be forced into tiling, once the size of a single frame becomes prohibitively large, so it would not fit into contiguous memory anymore.
    • What are typical size requirements for one frame of CDNG raw data?
    • DCI 4k is 4096 x 2160. If we assume three channels with floats (4 bytes), we get 12 byte per pixel = 106168320 bytes ~ 101 MiB
  • Does Lumiera ask OC to load files/frames and receives them through some defined structure together with meta data.
  • Or does it load the files itself and asks OC to de-Bayer and process them?
  • Or maybe one of the described things above, depending on the situation and format? Lumiera decides then what to do and what parts of OC it uses to accomplish the task.

our primary concern is to avoid blocking. If possible and feasible, we'd try to separate IO bound activities from computational activities. Maybe we'd also extend that to separate between CPU dominated and GPU dominated tasks. We will have separate priority queues for each of these, I'd guess even a queue per worker thread. plus an additional queue for background and administrative tasks. We will build guesses for each task's duration and schedule tasks just in time (to work out the details of that will be tricky, but hey, that's where the fun begins...)

  • So it will make sense to have a distinct separate call, only for the processing part, (memory) buffer to buffer.
  • for the pure IO part, we will have a mechanism to get a part of a larger file (or even a full raw, if possible) mapped into a memory buffer.

Expanding on that, we'll probably need some routine for OC to make sense out of the raw data within that buffer. And we need some way to pass on metadata, correct?

...if there is some overarching state you need for processing a frame (or even multiple frames), we could go the old and proven "Handle / PImpl" route. As you probably know, you can build very nice opaque handle types on top of shared_ptr. On the API, we'd only expose the processing-Interface for the user. Thus, in order to do anything, the client needs to get such a handle instance from the OC library. Which then, opaquely, points to the state data you need internally to keep track of some stuff. And since the handle is ref-counting, clean-up works magically and airtight.

Hi and sorry for not answering, i was absent from home, but not from project.

CDNG is a very flexible format, as It's more or less an extension for TIFF. So the size depends on resolution and color depth. Also we don't have to work all the time with such large resolution, for final preview and rendering maybe, but for a draft half of it or less would be sufficient. Like AfterEffects does it. OpenEXR is maybe another format that is appropriate for such things, but my experience with it narrows down to 3D renderings in Blender.

As you have more experience with color spaces and similar things, there are a couple of questions that bother me for some days. How would Lumiera display 16bit per channel images (48bit ones)? What was the preferred storage format for image processing, integer or float? At the moment i'm trying to get the output through OpenGL, but it's a challenge for itself with all the texture formats, normalization, right shader data types etc. I'm trying to stay with integer as it's more prone to the bit wiggling as opposed to float.

My "Aha" story from a couple of days ago. I was driving home and thought about multithreaded processing and almost slapped myself on the forehead as i remembered my former processing attempt for Bayer images. I used multiple arrays, one of them was for RGGB (or similar patterns) and have done complex modulo calculations, awkward iteration and so on. But then i remembered that every thread could process a row of RG or GB (or every other combination), which would be controlled by a couple of parameters and write it to an output array. I see no problem as the threads would never write to the same position, at least if there is no mistake in the code. Performance-wise some benchmarks have to be done, also for possible OpenCL implementation. Sometimes a nice solution occurs out of a thin air. By the way, what's Lumiera's point of view on OpenCL or similar things? Maybe also on OpenMP?

One thing which requires processing over multiple frames that comes to my mind is https://en.wikipedia.org/wiki/Superresolution. Also maybe some sort of automatic brightness adjustment to match multiple clips.

Ichthyo added a comment.EditedNov 17 2015, 10:08 PM

...and thought about multithreaded processing and almost slapped myself on the forehead as i remembered my former processing attempt for Bayer images. I used multiple arrays, one of them was for RGGB (or similar patterns) and have done complex modulo calculations, awkward iteration and so on. But then i remembered that every thread could process a row of RG or GB (or every other combination), which would be controlled by a couple of parameters and write it to an output array. I see no problem as the threads would never write to the same position, at least if there is no mistake in the code.

Is the image data you're referring to planar or interleaved?

Have you considered the effects due to cache management?

It does not matter if its the same memory cell or not, what matters is that you are interleaving reads and/or writes from different cores to memory locations in close proximity. Many modern systems are cache coherent NUMA, where the cache coherency was added to keep matters manageable for the average programmer. But ensuring this cache coherency comes at a significant performance price tag. That is to say, the cache management hardware has to do heavy work and inter-core communication to make the writes done by one core into its local cache line visible in the cache line of the other core.

Next thing to consider is lock contention.

A prerequisite of your approach is that prior to starting those two threads to work in interleaved rows, you have to get the raw data into memory. And afterwards, you have to get the combined processed data out. Which effectively means, you have to coordinate the work of the two threads. Not only does any kind of synchronisation create an overhead, worse still, you get some probability that one core is finished earlier and either idles, or the kernel decides to migrate another working thread to that core, which is a bummer.

Figuratively speaking, you prevent your cores from "getting into flow"

And last but not least, are you sure the whole operation is CPU bound, not IO bound? If your whole throughput is limited by the long-term sustainable IO bandwidth, then there is no point in any CPU (or GPU/shader based) parallelisation...

Now, in the end, these conditions largely depend on external circumstances. Like said, if your image storage format is planar or interleaved, or if you're doing a preview render or a final full resolution render. Or the actual hardware available. So it's hard to optimise right, if at all. And even harder to prove your optimisations actually result in a performance gain in the average case.

That is why I personally would always try to stick to the generic wisdom, that in any flow-based computing, it is important that parallel flows are able to move past each other. That is, practically speaking: let a single thread work on a single frame, and then just walk away and pick the next job available, working on another memory location.

That is the rationale why, as I've explained, from a Lumiera POV, we'd like to get a simple library function that does just the number crunching (single threaded, no IO, no locks), and care to get the buffer contents in and out as a separated task.

One thing which requires processing over multiple frames that comes to my mind is https://en.wikipedia.org/wiki/Superresolution.

yes, good point.
Another obvious case is image stabilisation and motion tracking.
This kind of stuff might get nasty if a single frame is 100MiB in memory -- yet still I'd be reluctant to care too much for such special cases

Ichthyo added a comment.EditedNov 17 2015, 10:44 PM

... there are a couple of questions that bother me for some days.

  • How would Lumiera display 16bit per channel images (48bit ones)?

For display, bottom line is you need to deliver a format your graphics hardware can cope with. So it depends (unfortunately)

  • the vast majority of displays is just 8bit per channel
  • some more professional hardware has 10bit, expect more to come in future

So in all these cases, the application has to convert to target colour space and then dither down

But, what is increasingly common is hardware that

  • offers support for colour management (profiles)
  • offers to accept 16bit per channel

With such hardware, you're at the mercy of the internal (and typically not disclosed) implementation, so the resulting quality might be poor, compared with what can be achieved when doing it properly in software (more so, since the resolution for display is typically reduced, so most of the pressure is out anyway)

  • What was the preferred storage format for image processing, integer or float?

Again, it depends (sorry for that).
The only rule is that you always need headroom, if you want precision

  • all the popular computer video formats are significantly less-than 8bit, they are heavily quantised in the colour domain. So here the challenge is to avoid re-quantisation artefacts, i.e. banding and moiree.
  • if your target is nominally 16bit, but you know that all available displays will just show 10 or 12bit, you can get away with calculations in 16bit integer
  • the moment a user really wants to use the full resolution from shadows to highlights, from pure gray to excessively saturated colours with fine and smooth gradations, plus the raw material does provide this full bandwidth, you need more headroom, i.e. you need floats or 32bit integers

The latter case is relevant for any use of HDR imaging, or HDR lighting combined with 3D rendering, or when the goal is to prevent the "digital look" in highlights and really produce clean real world photography without excess lighting on set, or when some kind of colour masking is required (blue screen, green screen). For these cases, again chances are that you need float or 32bit.

In the end it's a difficult decision. See, we're in the OpenSource world. The grand majority of folks there are just happy with an 8bit GIMP and data reduced MP3 and video. On the other hand, people accustomed to "professional" working style tend to book an excess of bandwidth just to be sure they don't have to care, or fear to run into a dead end later down the project. Thus, if a system does not offer at least solid 16 bit (which means excess of capability in processing, i.e. use float or 32bit integers), chances are that such a system will be put off as an irrelevant toy immediately.

So, for us, it boils down to the question: do we want to be relevant for those premium use cases, and / or do we want to to be technologically competitive or excellent?

By the way, what's Lumiera's point of view on OpenCL or similar things? Maybe also on OpenMP?

"intriguing", to use the words of Mr. Data...

However, our goal is to build a system that can combine the widely available technologies for video and sound processing. We are prepared to fill the gaps, if necessary. Our task is to handle buffers with various kinds of data, and to know which external processing function can deal with which kind of data, be it compressed, raw, texture, sound, MIDI or whatever, and how to instruct it to do so.

BAndiT1983 added a comment.EditedNov 22 2015, 9:02 PM

Ah, a Trekkie. I thought you were more of original series guy rather than Next Generation. ;)

Current state by the way:

It takes 60-70ms to decode raw array at the moment (source: https://www.apertus.org/axiom-beta-hello-world-article-may-2015). I took a simple route, signle-threaded for now as you suggested, so optimizations are done only rudimentary, more to come soon. Final image is displayed by OpenGL which merges 3 arrays (R, G and B, 16bit, preprocessed already to fill missing pixels) by using a simple shader into a texture. The slight lines are resulting from simple doubling the pixels in the row, missing good demosaicing method for now.

I could also grab the final image and use it somewhere else, e.g. return it to Lumiera for further processing. Headless OpenGL (off-screen processing) is a thing which i consider for a pipeline, maybe as a switchable option for people with a graphic card which supports OpenGL 3.3.

Next step is to pack the processing into API.

Edit: Now it takes only 30-35ms, because I've used 2 threads. One for RG extraction, another for GB extraction (even and odd rows per thread). Plus it doubles the pixels for empty cells, so just extraction takes less time. Will do some sort of benchmark for multiple cases, so we can evaluate it further. Possibly the processing time can be reduced even more if 4 thread are used, 2 for upper half, and 2 for lower half. Will also test with 8 threads, but just for fun, as it is relatively simple to split such linear task. On the other hand, I'm not a multithreading expert and always wanted to get into that topic.

Ichthyo added a comment.EditedNov 29 2015, 2:16 AM

It takes 60-70ms to decode raw array at the moment (source: https://www.apertus.org/axiom-beta-hello-world-article-may-2015).

This frame is half size, correct?
hello_world_3F_10ms_sun95.png: PNG image data, 2048 x 1536, 16-bit/color RGB

When I load that into Gimp (2.9 beta) and save it as uncompressed 16bit TIF, I get 18 MiB; the raw example frame is also 18 MiB and this is also what you'd expect theoretically (3145728 * 2 Byte * 3 Chan)

So in worst case, when the frame is white noise and thus the result does not compress at all, for each frame you have to get in and out 36 MiB. Please relate that to a typical USB 2.0 data rate of 35 MiB / s. Also I've seen lots of not-too-bad office-style PCs, which make about 50 MiB/s of IO. Of course, if you care to set up a RAID with internal SATA disks, you can get much better. But then typically the size is limited.

What I am getting at is: if you get just 35 MiB in/out per second, you've got one full second of full power multi core CPU + GPU to burn, which is a lot. Even if your system makes 100 MiB/s IO stable long term, you just get 3 Frames in/out per second. And this is half size. Quadruple for full 4k size.

See now why we're concentrating so much on the IO side of things? In such a situation, I wouldn't spend much effort on parallelising. And when I do, I'd just parallelise single frames :-D
That will buy us again several years of Computing power evolution, since the number of cores will probably rise faster that the IO bandwidth or the memory - cache access speed.

Anyway, I always enjoy playing with various algorithm parameters. Its the best way to get a feeling for the relations :)

BAndiT1983 added a comment.EditedNov 29 2015, 10:51 AM

I've prepared a longer answer, but stashed it away in a text file for now as i've noticed that more benchmarks for several parts are necessary.

My numbers are based purely on the processing routine and are done in Debug mode, they should just represent the change between 1 or 2 threads. The file itself is this hello_world_3F_10ms_sun95.raw12.xz, which contains purely 12bit Bayer data stream from the sensor and resolution of 4096x3072. First step was to convert to 16bit and from then on to extract the colors to separate arrays (channels), with just a simple pixel doubling to fill the gaps of Bayer data (poor man's de-Bayering), afterwards i use them as textures and combine with a shader for final image. The Release mode measurements are faster of course, if i remember correctly 2x or even 3x faster.

Currently i'm testing with SSD and this results in 3-4ms to get file content to RAM. Notice, it's already a file with just image data, so other formats like OpenEXR or CinemaDNG will take longer to process. But as i've already mentoned earlier, one can cheat the problem and in case of CDNG just load the image data from the same position of the following files in the sequence, as it is usually the same. I don't take file-dependent data, like some curve, into account at the moment.

All in all, i get something around 30-40 seconds for file loading and color extraction (2 threads) without interpolation on a SSD in Release mode. HDD tests will follow soon and i will try to give you a comparison table for Debug/Release mode and other things which have to be taken into account.

Edit: By the way, my goal is to give you the fastest way to get the frames processed, which can be further accelerated by using OpenMP (tried this already, good results), OpenCL or SIMD (SSE2/3/4, AVX etc.). The throughput from disk (HDD, SSD, USB) to ram is another topic. Lumiera should only request by supplying the raw image data or asking to load the file (depends on the case) and OC delivers some buffers (separate channels or merged into one interleaved array) as fast as possible. We should experiment at the moment, as i'm still trying to figure out how the API should be structured. Will commit ProcessingTest in a couple of days for review.

troy_s added a subscriber: troy_s.Dec 29 2015, 9:26 PM

I sincerely hope both of you are considering an offline pipeline while theorizing. Offline to picture lock. Focus on that.

All of the raw loading and such is more the work for a conforming tool, not an NLE. See Hiero for example.

It's one of the goals on the list. I haven't forgotten our previous discussions.

...while I'd like to add, that this whole "offline" - "online" distinction seems to become more and more an obsolete style of thinking. In a similar way, as the distinction between "editing" and "compositing", or "editing" and "vfx" became more and more blurry and artificial and besides the point.

If you consider the abilities of today's average hardware and software, it is of much greater concern that the traditional "EDL" interface becomes way too limiting and inadequate. In fact, this already creates a dangerous drag towards all-inclusive systems, which effectively lock you into a single system, the moment you're start thinking montage beyond just edits, frame numbers and transitions.

If we want to retain some level of collaboration between several, specialised systems (and I sincerely believe this is something we should strive at), then we need a way more elaborate exchange and mapping of metadata between the involved systems (here I use the term "metadata" in a wide sense, covering anything that is not just the raw media data itself)

troy_s added a comment.EditedDec 30 2015, 1:50 AM

It isn't obsolete at all. If you participate in a project you would understand exactly why it is divided into such. Costs, time, and quality all need to orbit around such a breakdown. There is no such blurry distinction to anyone with experience in the field. If anything, the division is much clearer than it was twenty years ago.

You can't commence compositing before you have a concrete idea of what frames you are operating on. That means you aren't engaging in any work on any single pixels until you have a picture lock.

Further, there is no amount of horsepower that is going to be able to run an NLE (nor should it) and deal with even a most rudimentary composite, let alone deal with colour transforms and accuracy, nor specific tools written to interact with a given shot of any complexity. It simply cannot work. See Blinn's Law.

So let's consider the abilities of today's hardware software, shall we?

Let's take a simple key pull. Let's say you have six to ten regions including garbage mattes. Let's toss in some tracking. Let's toss in a CGI rendered object. Also consider that this is at a bare minimum 32 bit per channel scene referred float. Now tack on a view transform on top of that to do some work on a 709 display.

I can state beyond a shadow of a doubt that there is no NLE system on the face of this planet can ever deal with this in realtime, no matter how many GPUs you throw at it. Let's remember this is a trivial example, with absolutely no degree of complexity. Even the above would drag any system to a halt, let alone when we factor in deep compositing.

Now lets also factor in that we have the exact same pipeline for grading, except with entirely different toolsets.

I'm relatively sure that every single Libre NLE software would be prudent to look at the above and realize that the reasons they have all failed so miserably is precisely because they fail to see a clear outline of a typical post production pipeline. I challenge anyone to prove that statement wrong. This is an extremely saddening concept given the abilities of many of the coders I have seen attempt to tackle this monumental concept and fail miserably.

So I'm sorry, but having been around imaging for about thirty years, and around Libre software for about fifteen, I'm going to go to the mat and say that anyone that believes you can skirt around traditional pipelines is a delusional architecture astronaut.

EDLs (all of them including XML formats etc.) do not drag you into an all-in-one kitchen sink application. In fact when you consider that an EDL is a blueprint text file, they enable cross-tool interactions and enable collaboration. There have been many variants of EDLs, including very robust versions such as the earlier FCP XML format.

The larger issue here is interchange, and in particular, interchange at a base level which is almost always EDLs simply because it is so well broken down.

aleb added a subscriber: aleb.Jul 17 2018, 5:20 PM