⌂ Home

Requirements for Efficient and Safe Data Marshaling

To motivate our design of a generalized, type-driven system for data transformation, we examine the problem of data marshaling across a wide range of application contexts and aim to unify them under two principal perspectives: First, efficiency — our system should minimize unnecessary overhead and automatically balance trade-offs to enhance overall performance. Equally important, we consider the impact on software engineering. The system must enhance type safety while also increasing flexibility and adaptability regarding the targeted data format. This is particularly valuable for supporting legacy systems, ensuring cross-language compatibility, and enabling compiler-assisted access to cross-domain optimizations (e.g., for marshaling data to GPU kernels, employing data-oriented design, or leveraging compressed data structures).

[more clearly separate future work]

Our objective is to generalize diverse forms of marshaling — including IPC argument marshaling, network protocol marshaling, memory layout transformations, storage layout conversion, and even UI-related data formatting — into a unified model of data transformation. Not all use cases require serialization. While message-passing and storage often demand data to be serialized, in-process transformations or shared-memory interactions may rely on direct memory layouts. Therefore, our system must support both serialized and in-memory formats, enabling seamless and safe transformations between them. Now lets break down our goals of efficiency and software quality in detail:

(I) Minimized Marshaling Overhead

Marshaling overhead — the computational cost of converting between data representations — must be minimized. In cases involving legacy systems, file formats, or hardware interfaces, transformations can often be performed directly, without intermediary steps, as the target format is fixed. In such scenarios, the primary objective is to generate the most efficient transformation possible. In contrast, message-passing inter-process communication (IPC) requires selecting a wire format — which is inherently serialized. However, aside from being serialized, the encoding of this intermediary data is typically unconstrained, offering the flexibility to select an optimized format. If both endpoints can dynamically adapt their marshaling strategies, the system can choose the wire format at runtime to maximize performance. This dynamic adaptability stands in contrast to fixed wire formats, which may introduce unnecessary or redundant transformations (see section ??).

(II) Minimized Boilerplate

Another key requirement is the reduction of boilerplate in application code. Data-centric software often involves repetitive code for tasks like serialization, validation, and transformation. Here our system cuts directly to the core by targeting the optimal internal representation at each communication endpoint, thus avoiding intermediate transformations and interface layers. Traditional interface definition languages (IDLs) generate language-specific interfaces, but these often require additional developer effort to map messages into internal application data structures. In contrast, our system’s specification language is designed to be precise enough to describe and directly target the intended data structures, eliminating the need for custom glue code. Beyond reducing boilerplate, the system promotes modularity by enabling reusable libraries of data transformation functions, guided by the type system. This helps developers structure and maintain their code while preserving strong type safety.

(III) Interoperability

Portability and cross-language interoperability are essential in modern, heterogeneous environments. Our system aims to support a wide range of deployment targets, including resource-constrained microcontrollers as well as high-performance compute clusters, while enabling seamless transformation across language and platform boundaries. It should be capable of mapping between disparate native constructs of different languages, such as C structs and Python PyObjects, and operate effectively over varied memory representations. In addition, integration with established protocols — such as Protocol Buffers — is essential to ensure compatibility with existing systems. These capabilities enable the development of modular, interoperable software across diverse ecosystems.

(IV) Adaptivity & Self-Optimization

The system should support self-optimization for distributed deployments by dynamically adapting communication configurations to hardware and network environments. Leveraging customizable wire formats, the system can automatically tune performance metrics such as throughput, latency, or energy consumption, while maintaining type-safe operation. For example, in memory-constrained MCU deployments, it may be advantageous to offload transformation logic to external systems to reduce code size. Likewise, depending on network speed or other constraints, it may be more efficient to trade computational cost for reduced transmission time via optimized wire formats. Our system enables such trade-offs by supporting dynamic optimization strate- gies. It can suggest wire formats tailored to global performance goals, using a multi- dimensional cost model that reflects application-specific priorities. This opens the door to system-level optimizations previously infeasible under rigid, fixed-format com- munication protocols.

(V) Expressive Data Description

To meet the above requirements, the data de- scription language must be expressive and precise, capable of accurately modeling both low-level and high-level aspects of data formats. At the low level, it should cap- ture bit-level encodings, byte order, structure layouts, padding, alignment, packed structures, and array strides. Simultaneously, the type system should preserve high-level abstractions — such as units of measurement or other higher semantic interpretations — without sacrificing control over the physical layout, if needed. The layered approach of ladder-types enables developers to reason about data both abstractly and concretely (cf. 2.1.5). Another aspect is the support for dependent types, i.e. types whose structure or behavior depends on values elsewhere in the data. This is particularly important when parsing complex serialized formats, where elements such as array lengths, field offsets, or conditional branches are governed by metadata or header values. Dependent types make it possible to express these value-based relationships directly within the data specification, enabling precise, type-safe modeling of real-world formats where such dynamic structure is the norm

Ladder Types

In order to implement complex datastructures and algorithms, usually many layers of abstraction are built ontop of each other. Consequently higher-level data types are encoded into lower-level data types, forming a chain of embeddings from concept to `rock bottom' of byte streams. While a high-level type makes claims about the...