Strong consistency guarantees that every read reflects the most recent write. It ensures that all data views are updated immediately and accurately after a change. Strong consistency is typically associated with orchestration, since it often relies on a central coordinator to manage atomic updates across multiple data views — either updating all at once, or none at all. Such “over-engineering” may be required for systems where minor discrepancies can be disastrous, e.g. financial transactions, but not in our case.
Eventual consistency allows for temporary discrepancies between data views, but given enough time, all views will converge to the same state. This approach typically pairs with choreography, where each worker reacts to events independently and asynchronously, without needing a central coordinator.
The asynchronous and loosely-coupled design of the dataflow architecture is characterised by eventual consistency of data views, achieved through a choreography of materialisation logic.
And there are perks to that.
Perks: at the system level
Resilience to partial failures: The asynchrony of choreography is more robust against component failures or performance bottlenecks, as disruptions are contained locally. In contrast, orchestration can propagate failures across the system, amplifying the issue through tight coupling.
Simplified write path: Choreography also reduces the responsibility of the write path, which reduces the code surface area for bugs to corrupt the source of truth. Conversely, orchestration makes the write path more complex, and increasingly harder to maintain as the number of different data representations grows.
Perks: at the human level
The decentralised control logic of choreography allows different materialisation stages to be developed, specialised, and maintained independently and concurrently.
The spreadsheet ideal
A reliable dataflow system is akin to a spreadsheet: when one cell changes, all related cells update instantly — no manual effort required.
In an ideal dataflow system, we want the same effect: when an upstream data view changes, all dependent views update seamlessly. Like in a spreadsheet, we shouldn’t have to worry about how it works; it just should.
But ensuring this level of reliability in distributed systems is far from simple. Network partitions, service outages, and machine failures are the norm rather than the exception, and the concurrency in the ingestion pipeline only adds complexity.
Since message queues in the ingestion pipeline provide reliability guarantees, deterministic retries can make transient faults seem like they never happened. To achieve that, our ingestion workers need to adopt the event-driven work ethic:
Pure functions have no free will
In computer science, pure functions exhibit determinism, meaning their behaviour is entirely predictable and repeatable.
They are ephemeral — here for a moment and gone the next, retaining no state beyond their lifespan. Naked they come, and naked they shall go. And from the immutable message inscribed into their birth, their legacy is determined. They always return the same output for the same input — everything unfolds exactly as predestined.
And that is exactly what we want our ingestion workers to be.
Immutable inputs (statelessness)
This immutable message encapsulates all necessary information, removing any dependency on external, changeable data. Essentially we are passing data to the workers by value rather than by reference, such that processing a message tomorrow would yield the same result as it would today.
Task isolation
To avoid concurrency issues, workers should not share mutable state.
Transitional states within the workers should be isolated, like local variables in pure functions — without reliance on shared caches for intermediate computation.
It’s also crucial to scope tasks independently, ensuring that each worker handles tasks without sharing input or output spaces, allowing parallel execution without race conditions. E.g. scoping the user fitness profiling task by a particular user_id, since inputs (workouts) are outputs (user fitness metrics) are tied to a unique user.
Deterministic execution
Non-determinism can sneak in easily: using system clocks, depending on external data sources, probabilistic/statistical algorithms relying on random numbers, can all lead to unpredictable results. To prevent this, we embed all “moving parts” (e.g. random seeds or timestamp) directly in the immutable message.
Deterministic ordering
Load balancing with message queues (multiple workers per queue) can result in out-of-order message processing when a message is retried after the next one is already processed. E.g. Out-of-order evaluation of user fitness challenge results appearing as 50% completion to 70% and back to 60%, when it should increase monotonically. For operations that require sequential execution, like inserting a record followed by notifying a third-party service, out-of-order processing could break such causal dependencies.
At the application level, these sequential operations should either run synchronously on a single worker or be split into separate sequential stages of materialisation.
At the ingestion pipeline level, we could assign only one worker per queue to ensure serialised processing that “blocks” until retry is successful. To maintain load balancing, you can use multiple queues with a consistent hash exchange that routes messages based on the hash of the routing key. This achieves a similar effect to Kafka’s hashed partition key approach.
Idempotent outputs
Idempotence is a property where multiple executions of a piece of code should always yield the same result, no matter how many times it got executed.
For example, a trivial database “insert” operation is not idempotent while an “insert if does not exist” operation is.
This ensures that you get the same outcome as if the worker only executed once, regardless of how many retries it actually took.
Caveat: Note that unlike pure functions, the worker does not “return” an object in the programming sense. Instead, they overwrite a portion of the database. While this may look like a side-effect, you can think of this overwrite as similar to the immutable output of a pure function: once the worker commits the result, it reflects a final, unchangeable state.
Dataflow in client-side applications
Traditionally, we think of web/mobile apps as stateless clients talking to a central database. However, modern “single-page” frameworks have changed the game, offering “stateful” client-side interaction and persistent local storage.
This extends our dataflow architecture beyond the confines of a backend system into a multitude of client devices. Think of the on-device state (the “model” in model-view-controller) as derived view of server state — the screen displays a materialised view of local on-device state, which mirrors the central backend’s state.
Push-based protocols like server-sent events and WebSockets take this analogy further, enabling servers to actively push updates to the client without relying on polling — delivering eventual consistency from end to end.
Source link
#Dataflow #ArchitectureDerived #Data #Views #Eventual #Consistency #caleb #lee #Oct
Unlock the potential of cutting-edge AI solutions with our comprehensive offerings. As a leading provider in the AI landscape, we harness the power of artificial intelligence to revolutionize industries. From machine learning and data analytics to natural language processing and computer vision, our AI solutions are designed to enhance efficiency and drive innovation. Explore the limitless possibilities of AI-driven insights and automation that propel your business forward. With a commitment to staying at the forefront of the rapidly evolving AI market, we deliver tailored solutions that meet your specific needs. Join us on the forefront of technological advancement, and let AI redefine the way you operate and succeed in a competitive landscape. Embrace the future with AI excellence, where possibilities are limitless, and competition is surpassed.