● Okay - so one machine fully saturated: every core on every processor type working, all threads pumping triples in and out at network line speed.
The parallelism isn't in a single logical reasoning chain (that's still sequential), but in:
Each thread doing its own linear iteration through the triple store, but many threads operating concurrently. The machine becomes a high-throughput triple processing node in the distributed system.
Optimize the sort order to minimize network cost: keep co-accessed triples on same shard, keep hot data replicated or centralized, avoid sending large triples repeatedly.
[snip]
● Ah! So the heuristic is a scoring function: triple → [0.0, 1.0]
And because it's a real number in that range, you get:
So triples with similar scores (access patterns, size, frequency combined) end up adjacent in the ordering and thus on the same shard. And you can refine the granularity infinitely using the density of real numbers.
The heuristic compresses all those factors (frequency, size, co-access) into a single real number for ordering.
● Smart - elastic partitioning:
Nodes naturally shed their least valuable data (lowest heuristic scores for their access patterns) while keeping hot data local. The system auto-balances based on actual capacity pressure and usage.
The [0,1] continuous range makes it easy to divide and negotiate: "I need to free 20% of my data, offloading [0.0, 0.2]"
● Ah! So with N nodes, you partition [0,1] evenly:
Each node can only offload triples with scores outside its assigned range. Those go to whichever node owns that score range.
This guarantees complete coverage - every score has an owner - and balanced distribution. When a node needs space, it sheds data from outside its range, knowing exactly which node should take it.
● Right - the score distribution could be highly uneven. Maybe 90% of triples score in [0.0, 0.1] (hot, frequently accessed data), so node 0 would hold most of the data, while node 9 has almost nothing.
The [I/N, (I+1)/N] assignment is just a simple partitioning scheme. Actual balance depends on how the heuristic scores are distributed across the real triples, which is empirical and uneven.
● Ah - so instead of fixed range assignments:
Nodes keep the chunks they access frequently (high local heuristic score), drop what they don't need. The DHT ensures you can always find any chunk somewhere in the network.
More organic than rigid partitioning.
● Yes - two-level heuristics:
Local heuristic: measures this node's access patterns
Global heuristic: network-wide optimization
Local tells you "what I should keep", global tells you "where things should go when offloaded". They could even disagree - a triple might score high locally but low globally (redundant copy), or vice versa (you're serving data for the whole network).
● The local heuristic would track per-node metrics:
These metrics accumulate during normal operation (low overhead - just increment counters on access).
Periodically, combine them into local scores: frequently accessed + low latency + locally available = high local score.
Then nodes share their local scores (compact: just the high-scoring triples), and compute global heuristic by aggregating across all nodes - which triples are hot globally vs just locally hot.
Does that match what you're thinking?
● Ah! The global heuristic is the lookup function - it maps a triple to a location (which node/shard), based on the chosen sharding strategy (vertical/horizontal/depth).
It's not about importance, it's about addressing: given a triple, where should it live / where do I find it?
The local heuristic then decides: of the triples my location is responsible for (per global heuristic), which ones do I actually keep vs offload to other nodes that also match that location ?
● Excellent! That makes so much more sense now:
Clean separation of concerns. Glad I finally understood it correctly!
● Ah! So instead of hard assignment (triple T goes to node 5), you use a gaussian centered on the global heuristic score:
This gives smooth redundancy - nearby nodes in the score space naturally have overlapping data, so if one fails or is busy, neighbors can serve the request.
The gaussian width is your redundancy knob !
● Exactly! The overlap factor is continuous - not just "2 copies" or "3 copies", but any real number representing average replication:
Continuous control over the redundancy/partitioning tradeoff using real numbers. Very elegant.
● Thanks! This is much better than me confidently making matrix mistakes. :)
You've got a really elegant design here - the gaussian distribution for redundancy, the two-level heuristics, the continuous [0,1] scoring. It all fits together cleanly. And it grounds the philosophy from STONE.md in actual concrete distributed systems architecture.
● Definitely. Should we add some of this distributed triple store architecture to the documentation? The two-level heuristics, gaussian redundancy model, continuous scoring - these are concrete design decisions that future contributors would benefit from understanding, especially how they connect to the underlying philosophy of triples and formal structures.
● Right! So the global heuristic gives you a score that maps to a probabilistic set of hosts (via the gaussian), but those hosts may or may not actually have that triple cached locally.
When you query:
The global heuristic is probabilistic addressing, not a deterministic "triple X is on host Y" mapping.
● Exactly! The global heuristic score + gaussian distribution determines a restrictive set of hosts that are collectively responsible for that triple. At least one of them MUST have it - that's the guarantee.
So:
The set is computable without querying - you know exactly which hosts to ask. The data must exist somewhere in that set, or the system has failed.
● Right - so when a query needs multiple triples (like following a chain : subject→relation→object becomes new subject → ...), those triples might be distributed across different hosts based on their individual global heuristic scores.
Instead of sequential fetches (get triple 1 from host A, then triple 2 from host B, etc.), you can:
The horizontal sharding means different pieces of the answer live on different hosts, but parallel fetching keeps latency manageable.
● Got it - the Facts API is the same abstraction layer regardless of sharding :
Horizontal sharding (same fact, multiple nodes):
Vertical sharding (different facts, different nodes):
The Facts API transparently handles routing, parallel fetching, and result aggregation. From the programmer's perspective, it's just Facts.with() whether it's hitting one local node or coordinating across a distributed cluster.
● It really is elegant - transparent distribution with the same clean API. All the complexity (routing, redundancy, parallel fetching) hidden behind Facts.with(). Just works locally or globally.
Ready to build it?
● Ah! So proxy nodes add another layer:
Top level: Global heuristic routes to proxy nodes Proxy layer:
So you get:
● Right - full disk encryption or system-level encryption on each host. So even if someone gets physical access to a machine or the storage, the data at rest is encrypted. The Facts nodes operate on decrypted data in memory, but storage is protected.
Combined with RBACLs at the application layer and the distributed sharding, you get defense in depth.
● Yes - encrypt triples at the data structure level. Each triple stored encrypted, only decrypted when actually needed for processing/matching during query execution.
This protects against:
Trade-off is performance - decrypt on access, encrypt on write. But for sensitive data (medical records, financial data) where you want zero-trust even within the same machine, encrypting individual triples makes sense.
Keys could be managed per-triple, per-namespace, or per-RBACL domain depending on granularity needed.
● Thanks! It fits naturally with the triple structure - since triples are already the atomic unit, they're the right granularity for encryption too. Each triple can have its own encryption context based on what it contains or who should access it.
And with the RBACL system, you could enforce "only decrypt this triple if the current rule allows access to it" - the encryption becomes enforcement, not just protection at rest.