Runa-1 — A geometric map of who relates to whom

§01 — What it is, plainly

Decades of ecological records, folded into a single space.

Runa-1 takes open European ecological data and turns it into one geometric map of who relates to whom. Every species, and every environmental state, becomes a point in a shared space. Two points sit close together when the things they describe are ecologically related.

A predator lands near its prey’s neighbourhood. A pollinator settles near the plants it visits. A nitrogen-loving plant drifts toward the marker for high-nitrogen soil. Nothing is told where to go — the positions fall out of the relationships themselves.

It learns from typed triples: plain statements of the form (head, relation, tail). Given roughly 490,000 of them, the model arranges every point so that each relation becomes a consistent geometric move. Once that’s done, you can ask questions the source records never stated outright:

one training statement

Bombus terrestris pollinates Salix caprea

~490,000 statements like this become rotations in one shared geometry.

nearest-neighbour

“What is ecologically closest to this species?” — read straight off proximity.

relational

“Who might this species pollinate?” — follow the relation as a geometric operation, head ∘ relation → tail.

gap-filling

Plausible links nobody recorded, suggested by analogy to better-documented systems.

It is not a language model and not a forecaster. It is a knowledge-graph embedding (RotatE, trained with PyKEEN) — closer to a map than to a chatbot.

You shall know a species by the company it keeps.after J. R. Firth, 1957

§02 — Why “biosemiotic”

Species relate not only by contact, but by the signs they share.

On top of the ordinary ecological links — predation, pollination, mycorrhizae — Runa adds sign-relations: species treated as producers and readers of ecological signs within an Umwelt, the perceived world of an organism.

The novel move is small but consequential. These sign-relations are trained as ordinary typed edges, so the model places them in the same geometry as everything else. Two species that read the same environmental sign are pulled together — even if they never directly interact. That “relatedness by shared sign” is what the project is really about.

The intellectual lineage is explicit: Peirce’s index, Uexküll’s function-circle, Hoffmeyer’s semiotic scaffolding. Runa doesn’t try to decode any organism’s inner meaning; it models the relational structure of ecological sign-making, and leaves the inner world untouched.

§03 — At a glance

The artifact, in numbers.

489,871

typed triples
(stage-3 master)

74,147

entities in one
vector space

27 / 52

relation types
carrying data

0.2827

held-out MRR
(filtered)

128d

RotatE embedding
dimension

10

countries of
geographic scope

Three snapshots exist — GLOBI-only, then +Mangal, then +biosemiotic layer. The figures above describe the third. Everything below details how it was built and, just as importantly, what is and isn’t yet proven.

§04 — Entities & relations

Two kinds of node, two layers of relation.

The nodes

Biological entities — taxa, keyed by scientific name (canonical binomials from GLOBI and Mangal; not yet reconciled to the GBIF backbone). State & community nodes — the discrete non-biological targets the sign-relations point at: environmental states like soilNitrogen_b4 or thermalIndicator_b3, and detected communities community_0 … community_89.

Layer 1 — conventional ecology

Around 40 relation types, mapped to the OBO Relations Ontology where possible (PURLs taken from GLOBI’s authoritative ro.tsv): trophic (eats, preysOn), antagonistic (parasiteOf, pathogenOf), mutualistic (pollinates, ectomycorrhizalHostOf), structural (createsHabitatFor, epiphyteOf) and generic (interactsWith, coOccursWith).

Layer 2 — biosemiotic (the novel contribution)

ENVAI-namespaced sign-relations
Relation	Meaning	Tail node
indicatorOf	species is a sign of an environmental state (Peircean index)	env-state
keystoneSignProducerIn	keystone sign-producer within a community (Hoffmeyer scaffolding)	community
perceivesSignal	perceptual boundary of the Umwelt (Uexküll) — not built	signal
phenologicalIndicatorOf	phenological sign — defined, not populated	phenophase

One canonical direction is stored per relation; PyKEEN’s create_inverse_triples synthesises inverses at training time, so both directions score without double-counting. Inverse GLOBI labels (eatenBy, pollinatedBy) are flipped to canonical during ingestion.

§05 — Deriving the sign-edges

The signs aren’t in any database. They’re derived — so the derivation is stated, reproducible, and hash-frozen at deposit.

indicatorOf← EIVE / Ellenberg

Each European vascular plant’s published indicator value on six gradients — Light, Temperature, Moisture, Reaction/pH, Nutrients/N, Salinity — binned into five fixed classes over 0–10. These are published indicator values (expert-harmonised rankings), not raw site measurements: legitimate “measured signs” in the bioindication tradition. 3,055 of 8,908 EIVE plants matched existing graph species, wiring the edges into the interaction web.

keystoneSignProducerIn← graph centrality

No usable external keystone dataset exists. Keystone-ness is operationalised as within-community degree centrality above the 98th percentile on the real interaction graph (Louvain communities). Honest framing: “topologically central within a detected community” — a defensible proxy, not an experimentally demonstrated keystone.

perceivesSignal← deliberately deferred

Perception, not production. No species→perceived-signal dataset exists at scale — only slivers (bird UV-vision from opsin genetics; ~34 species in the Animal Audiograms Database). Xeno-Canto documents signal production, which is not a proxy for perception. Rather than fabricate edges, this relation is left for a future, honestly-scoped perception-only build.

open methodological choice The Notion spec describes an alternative indicatorOf route — Living Planet Index population trends × CHELSA climate shifts. EIVE is the niche / literature-attested route; the two are complementary, and the choice is held open (see §10).

§06 — Data & provenance

Open sources, filtered to a northern-European window.

Sources feeding the stage-3 master
Source	Role	License	In-scope
GLOBI 2026-05-29	typed interaction triples (the spine)	CC0 / CC-BY per dataset	368,983
Mangal v2 API	whole-network structure	open · per-network DOI	79,021
EIVE 1.0	Ellenberg values → indicatorOf	CC-BY 4.0	44,590
graph centrality (internal)	→ keystoneSignProducerIn	derived	302
GBIF	occurrence + taxonomy backbone	per-record CC0 / CC-BY	streamed

EIVE via Zenodo 10.5281/zenodo.7427088. GBIF: 69 GB zip (~300M+ occurrence rows; 10-country, coords, CC0 / CC-BY) is now streamed for the independent thermal-niche validation (§09). Full co-occurrence and environmental-layer use still deferred.

SENOFIDKISNLDEFRBECH

All nine ENVAI agents are covered, France/Ondine included. GLOBI is filtered by a coarse W-Europe/Nordic bounding box (lat 41–71.5, lon −25–32), to be refined by reverse-geocoding.

§07 — Pipeline (ETL)

From raw records to one master triple file.

Each stage is a module under src/envai/; every triple file carries [head, relation, tail, source, confidence].

stage1_globiStream GLOBI, filter to scope, map raw interaction names to the canonical vocabulary (collapsing inverses), dedupe.
stage3_mangalWalk the Mangal API (network / node / interaction), map types to the vocabulary.
stage_indicatorParse EIVE, bin each 0–10 gradient into 5 classes, match plants to graph entities, emit (plant, indicatorOf, gradient_bin).
stage_keystoneBuild the interaction graph, run Louvain community detection, take the top 2% within-community degree → (species, keystoneSignProducerIn, community).
stage5_taxonomyConcatenate and dedupe sources into a master. GBIF name reconciliation is implemented but deferred (--no-reconcile so far).
train · eval_per_relationRotatE / ComplEx via PyKEEN, then a per-relation metric breakdown.
serve_neo4jPush embeddings to a Neo4j vector index for queries — built, not yet run.

Deferred but coded: stage4_env (CHELSA / LUCAS → respondsTo* nodes), and serving. stage2_gbif extraction had a read-all-into-RAM bug that OOM-killed on the 100 GB CSV — fixed to stream with shutil.copyfileobj.

Independent validation path (new): scripts/extract_gbif_niche.py streams the GBIF zip into per-species realized thermal niches via CHELSA bio1, then envai.validate_thermal computes Spearman of the model thermal-indicator placement against real occurrence temperature. envai.bio_smoke_eval adds a textbook-directional smoke test.

§08 — Embedding model & training

RotatE, tuned to run on one Blackwell card.

Training configuration
Model	RotatE, embedding_dim 128 (complex → 256 real dims/entity)
Loss	NSSALoss (self-adversarial), margin 9.0, adv. temperature 1.0
Optimizer	Adam, lr 1e-3
Loop	sLCWA, basic negative sampler, 64 negatives/positive, batch 2048
Epochs	300, early stopping (freq 20, patience 5, Δ 0.002, metric = MRR)
Split	80 / 10 / 10, random_state 42, inverse triples on
Eval	RankBasedEvaluator, bounded (batch 128, slice 2048) to dodge the WDDM watchdog
Comparator	ComplEx (trained) — underperformed RotatE overall (MRR 0.177 vs 0.283) and on every relation including the 1-to-N biosemiotic ones (see §09). Box / hyperbolic remains the open comparator.

Training the ~490K-triple / 74K-entity graph at dim 128 takes roughly 15–35 minutes on a single modern GPU (~2–5 s/epoch).

§09 — Results

What the geometry learned — and where it’s honest about plateauing.

Model snapshots · held-out test, filtered ranking
Snapshot	Triples	Entities	Rel.	MRR	H@1	H@3	H@10
stage1 — GLOBI only	368,983	58,167	21	0.2645	0.167	0.291	0.465
stage2 — +Mangal	444,979	68,183	25	0.2824	0.182	0.314	0.487
stage3 — +biosemiotic	489,871	74,147	27	0.2827	0.184	0.315	0.484

Mangal enrichment improved every metric (MRR +6.8%, H@10 +4.7% over GLOBI-only). Adding the biosemiotic layer left the headline metrics flat — expected, since the new relations are heavily 1-to-N, where RotatE is weak. Crucially, they did not degrade the conventional structure.

Per-relation MRR (stage-3, RotatE)

biosemiotic relation conventional relation

indicatorOf is placed in the geometry more consistently than pollination, symbiosis, mutualism and mycorrhizae — it is genuinely learned. keystone scores highest, but on only 17 test edges.

read this before trusting the bars Both biosemiotic relations have small tail vocabularies (30 env-states, 90 communities), so tail-prediction is easy and inflates their scores relative to the 74K-entity species direction. The numbers show the relations are learnable — not that they capture independent ecological truth.

§10 — Validation status & honest limitations

This section is deliberately strict. It’s what separates an honest artifact from an overclaim.

①

The biosemiotic ablation, as run, is circular — so it is not yet validation. The held-out sign-edges were generated by the same rule (EIVE values / centrality) the model is then credited with recovering. The §09 result proves the relations are learnable and internally consistent, not that they capture independent truth. A real ablation needs an independent route: a held-out region, or literature-attested indicators the model never trained on. This is the current top priority.

②

RotatE is weak on 1-to-N and hierarchy. The sign-relations are 1-to-N; a ComplEx (and ideally box / hyperbolic) comparator is needed before trusting their geometry. Taxonomy should stay a structural constraint, never scored triples.

③

Small tail vocabularies inflate the biosemiotic metrics — see the §09 caveat.

④

No GBIF taxonomy reconciliation yet. Sources are joined by raw / normalised name strings; cross-source overlap is conservative (3,055 EIVE matches; ~1,900 shared Mangal/GLOBI names). This is the biggest pending quality lever.

⑤

Coverage gaps. GBIF co-occurrence and the CHELSA / LUCAS environmental layer are coded but not built; the bounding-box filter is coarse; counts are point-in-time.

⑥

Sign-edges are interpretive by construction — hypotheses the embedding helps test, documented with provenance, not ground truth.

§11 — The priority claim

The first trained relational embedding that operationalizes biosemiotic relations for ecology.

Narrow and defensible — not “first biosemiotic AI” in general. Here is exactly what is true today:

✓

The embedding exists, is trained, and is on disk — this resolves the “model vs. plan” question.

✓

The conventional-ecology embedding is validated by held-out link prediction.

✓

Two biosemiotic relations are in the trained geometry and learnable.

△

Their independent validation (a non-circular ablation) is not done.

△

perceivesSignal, one of the three named relations, is not built.

for an honest deposit The novelty lives in the relations, not the weights — so a Zenodo / EcoEvoRxiv deposit should include a hash-frozen snapshot of the exact triple set and the ETL scripts alongside the model and schema. Label it precisely: a validated ecological KGE with a learnable, not-yet-independently-validated biosemiotic layer.

§12 — Roadmap

In priority order.

Independent (non-circular) biosemiotic eval — in progressThermal axis under independent validation via GBIF occurrence × CHELSA bio1 (running). Remaining work: the other EIVE gradients, and an LPI × CHELSA population-trend axis. Result is the headline number that will close §09.
GBIF taxonomy reconciliationMerge sources onto the GBIF backbone; the biggest connectivity and quality gain.
Box / hyperbolic comparatorFor the 1-to-N biosemiotic relations — ComplEx already trained and underperformed RotatE; box / hyperbolic remains open.
GBIF co-occurrence + CHELSA / LUCASThe respondsTo* environmental layer (occurrence already downloaded).
ServingNeo4j vector index with nearest-neighbour, analogical and hybrid queries for the nine agents.
HPO & a perceivesSignal buildPerception-only, from audiogram / opsin data.
DepositHash-frozen triples + scripts + schema + model on Zenodo.