Project discovery: proposal

Discovery: architecture

For the needs of the application, we are looking for a way to subscribe to a feed of new projects that surface on the network. Initially, this is a way to explore and discover a hand-curated list of projects, but the long term goal for this is that all projects on the network can be discovered this way.

From the user’s point of view, we essentially want to:

  1. Have a way to specify seed nodes to connect to
  2. Explore the projects published by these nodes
  3. Subscribe to new projects as they are published

Proposal

There are two relevant components: the seed node, which is always online, and publishes interesting events, and the application, which is a client that connects to the node(s) and fetches these events to display them to the user.

Seed node

It’s important that the seed node is able to provide a neutral timeline of the projects being created in the network. Of course, the only way it can do that is to publish the projects as it discovers them. The simplest data-structure that serves this need is an append-only log of events.

The event, in its simplest form is Event { project_urn: RadUrn }.

The end-goal is for the node to have an HTTP API that can be used to register projects to track and replicate. When this API is called, the node starts looking for the project, and when it has successfully cloned it, it publishes an event on its feed. To start though, we can feed it a list of projects to initially track.

The feed, being a radicle-link entity type, is signed by the node every time a new event is published to it.

Though the minimal event only includes a URN, the node is free to add any metadata necessary, that the user might find interesting. For example, project stats, maintainers, language, description etc. This information can be parsed from the underlying project repo and included in the event feed.

Application

The application has configuration which specifies which seed nodes to track. These are DNS hostnames which point to a (PeerId, IpAddr) pair.

When connected to a seed node, the application requests the event feed from a well-known location. Eg. /namespaces/feed.

It parses the events, displays them to the user, and allows the user to track the projects.

Future

Eventually, users will also gain the ability to publish their own event feeds, just like the seed node. These can be freely intermixed with the “aggregates” published by the seed nodes. Each feed represents the subjective perspective of the user/node publishing it.

2 Likes

Revisiting this post in light of the recent user influx where a lot of people asked for some form of discovery feature.

What this proposal drives home is that the application (referring to here as the layer on top of the protocol) takes control over a subtree in the state to enable replication of specialised data. This feels like the right approach and we are looking to employ a similar strategy for other upcoming features.

What I like to understand is how the idea of a feed for this kind of feature will hold up for different other order heuristics than recency? For example a peer is asked for its top projects (by number of tracking remotes, number of git contributors, etc.). Furthermore, even when only used as a timeline, these feeds can grow quickly especially for popular seeds. How do we avoid the need to replicate the entire timeline?

Are there ways to model it differently? Maybe with the an approach where we maintain a set of heads which represent the top-N slots? Where we would have different sets for different orderings. Anytime a new project is tracked or a known updates the seed moves the heads accordingly.

Another observation, while we are in the fortunate position to express app specific data schemes in the mønøstate and can selectively replicate them with fetchspeccing, it seems whenever we need coordination we reach for out-of-band solutions like a HTTP API in your proposal. In the protocol there are currently two supported streams of data on a connection between two peers: git and gosssip. Could we reserve another stream variant over which an application (again liberally used for anything not protocol: so upstream, seed) can exchange self-defined messages? A peer can always choose to not accept such a stream. With the help of this an API like the seed would want to expose to manipulate it’s tracking graph could be done without any operational overhead.

These are not valid questions to ask, because there is no way to validate the claim except by computing the answers on a local copy. The number of tracking remotes is known per-project by examining rad/signed_refs.

There is indeed no way currently to ask a (random) peer for the list of namespaces it knows about. Since the response to such a query must yield a finite-memory response, I do not see a non-probabilistic solution.

By limiting the fetch depth. Entries in such a feed must either not depend on each other at all, or yield a commutative state when folded over.

Again, if a peer is to compute any kind of popularity metric, and everyone else just believes them, the best strategy for that peer is to publish wrong metrics in favour of whatever they want to be in the top-N. I fail to see how this provides any value.

This is isomorphic to an HTTP API reachable at a well-known address. If you already know whom to talk to, and that the other end will understand your self-defined messages, you can simply establish a connection.

Thanks for reviving this!

As I re-read my proposal, I think that we could just as well jump straight to what is proposed in the “Future” section. Instead of limiting this to the seed node, how about just have all users post about their own projects (or issues, or refs etc.), and have the seed node just replicate those feeds. The application can then intermix all the feeds into a global timeline and/or a list of forks or projects. Later, this same feature can be used as a general feed where users can make project announcements and the like.

I’m wondering if you meant to imply that this should be accessible through browsers. Of course, other stream types (in a reserved range or whatever) could be introduced, but I’m wondering how it would reduce operational overhead, as client support would need to be provided.

If you meant browsers, there are some caveats to consider:

  • It is not entirely clear which revisions of IETF QUIC browsers support (if any).
  • If we’d find a working combo, I believe QUIC is tied to HTTP/3 in browsers. The protocol deliberately doesn’t use HTTP/3, yet it is conceivable to dispatch earlier via ALPN. However, library/framework support for h3 is not yet existent for Rust.
  • If both worked, browsers would still not like to be presented a self-signed certificate. Perhaps surprisingly, this would be the easiest to work around: Rust has a letsencrypt (ACME) library, so it should be easy to submit a CSR for the certificate. One needs to own a domain name, though.