Skip to main content

The Log - What every software engineer should know about real-time data's unifying abstraction.

·2 mins

The Log: What every software engineer should know about real-time data’s unifying abstraction.

  • The log/table duality - the idea that the latter is just the final state of the former - has an interesting implication: how do we join a stream and a database table?
    • Convert the latter to a stream by using its changelog. (We can apply new transformations or use a partitioning scheme that’s different from the table’s original index.)
    • Now, both original stream and this new table-specific stream can use similar partitioning to become co-located on stream processors.
    • We can also take regular snapshots of the tables. If a processor restarts (or replaces an existing one), it could start with the last available snapshot and apply all changes (in the table stream) since the last checkpoint.
  • Interesting idea: you can look at an organization, with a variety of data-stores inside it (such as Redis, HDFS, key-value stores, relational etc.) as one giant distributed database. All these individual stores, which contain copies, albeit transformed, are just indexes.
  • Sequence numbers in a log (i.e. log indexes) are a useful concept. In a replicated system, these numbers help you know how up-to-date a replica is.
    • If a client wants read-after-write consistency, the client can supply the last sequence number it wrote. That way, the replica that got the query can easily check if it is at least as up-to-date as the client expects and, if not, wait for sometime or throw an error instead of returning an incorrect response.
  • Log vs serving layer is a useful way of looking at systems. All writes happen to a log which becomes the source of truth. They are then applied to a database (possibly partitioned 1:1 with the log) and the serving layer uses this to respond to client read requests.
  • Log shouldn’t be too costly. Reasons:
    • It does sequential writes and reads, so can utilize cheap HDDs. No need for huge RAM or SSDs.
    • One log will have multiple reader systems. So, cost gets amortized.