A geometric map of who relates to whom — the shared substrate beneath nine ecological agents.
Runa-1 takes open European ecological data and turns it into one geometric map of who relates to whom. Every species, and every environmental state, becomes a point in a shared space. Two points sit close together when the things they describe are ecologically related.
A predator lands near its prey’s neighbourhood. A pollinator settles near the plants it visits. A nitrogen-loving plant drifts toward the marker for high-nitrogen soil. Nothing is told where to go — the positions fall out of the relationships themselves.
It learns from typed triples: plain statements of the form (head, relation, tail). Given roughly 490,000 of them, the model arranges every point so that each relation becomes a consistent geometric move. Once that’s done, you can ask questions the source records never stated outright:
It is not a language model and not a forecaster. It is a knowledge-graph embedding (RotatE, trained with PyKEEN) — closer to a map than to a chatbot.
You shall know a species by the company it keeps.after J. R. Firth, 1957
On top of the ordinary ecological links — predation, pollination, mycorrhizae — Runa adds sign-relations: species treated as producers and readers of ecological signs within an Umwelt, the perceived world of an organism.
The novel move is small but consequential. These sign-relations are trained as ordinary typed edges, so the model places them in the same geometry as everything else. Two species that read the same environmental sign are pulled together — even if they never directly interact. That “relatedness by shared sign” is what the project is really about.
The intellectual lineage is explicit: Peirce’s index, Uexküll’s function-circle, Hoffmeyer’s semiotic scaffolding. Runa doesn’t try to decode any organism’s inner meaning; it models the relational structure of ecological sign-making, and leaves the inner world untouched.
Three snapshots exist — GLOBI-only, then +Mangal, then +biosemiotic layer. The figures above describe the third. Everything below details how it was built and, just as importantly, what is and isn’t yet proven.
Biological entities — taxa, keyed by scientific name (canonical binomials from GLOBI and Mangal; not yet reconciled to the GBIF backbone). State & community nodes — the discrete non-biological targets the sign-relations point at: environmental states like soilNitrogen_b4 or thermalIndicator_b3, and detected communities community_0 … community_89.
Around 40 relation types, mapped to the OBO Relations Ontology where possible (PURLs taken from GLOBI’s authoritative ro.tsv): trophic (eats, preysOn), antagonistic (parasiteOf, pathogenOf), mutualistic (pollinates, ectomycorrhizalHostOf), structural (createsHabitatFor, epiphyteOf) and generic (interactsWith, coOccursWith).
| Relation | Meaning | Tail node |
|---|---|---|
| indicatorOf | species is a sign of an environmental state (Peircean index) | env-state |
| keystoneSignProducerIn | keystone sign-producer within a community (Hoffmeyer scaffolding) | community |
| perceivesSignal | perceptual boundary of the Umwelt (Uexküll) — not built | signal |
| phenologicalIndicatorOf | phenological sign — defined, not populated | phenophase |
One canonical direction is stored per relation; PyKEEN’s create_inverse_triples synthesises inverses at training time, so both directions score without double-counting. Inverse GLOBI labels (eatenBy, pollinatedBy) are flipped to canonical during ingestion.
| Source | Role | License | In-scope |
|---|---|---|---|
| GLOBI 2026-05-29 | typed interaction triples (the spine) | CC0 / CC-BY per dataset | 368,983 |
| Mangal v2 API | whole-network structure | open · per-network DOI | 79,021 |
| EIVE 1.0 | Ellenberg values → indicatorOf | CC-BY 4.0 | 44,590 |
| graph centrality (internal) | → keystoneSignProducerIn | derived | 302 |
| GBIF | occurrence + taxonomy backbone | per-record CC0 / CC-BY | streamed |
EIVE via Zenodo 10.5281/zenodo.7427088. GBIF: 69 GB zip (~300M+ occurrence rows; 10-country, coords, CC0 / CC-BY) is now streamed for the independent thermal-niche validation (§09). Full co-occurrence and environmental-layer use still deferred.
All nine ENVAI agents are covered, France/Ondine included. GLOBI is filtered by a coarse W-Europe/Nordic bounding box (lat 41–71.5, lon −25–32), to be refined by reverse-geocoding.
Each stage is a module under src/envai/; every triple file carries [head, relation, tail, source, confidence].
Deferred but coded: stage4_env (CHELSA / LUCAS → respondsTo* nodes), and serving. stage2_gbif extraction had a read-all-into-RAM bug that OOM-killed on the 100 GB CSV — fixed to stream with shutil.copyfileobj.
Independent validation path (new): scripts/extract_gbif_niche.py streams the GBIF zip into per-species realized thermal niches via CHELSA bio1, then envai.validate_thermal computes Spearman of the model thermal-indicator placement against real occurrence temperature. envai.bio_smoke_eval adds a textbook-directional smoke test.
| Model | RotatE, embedding_dim 128 (complex → 256 real dims/entity) |
|---|---|
| Loss | NSSALoss (self-adversarial), margin 9.0, adv. temperature 1.0 |
| Optimizer | Adam, lr 1e-3 |
| Loop | sLCWA, basic negative sampler, 64 negatives/positive, batch 2048 |
| Epochs | 300, early stopping (freq 20, patience 5, Δ 0.002, metric = MRR) |
| Split | 80 / 10 / 10, random_state 42, inverse triples on |
| Eval | RankBasedEvaluator, bounded (batch 128, slice 2048) to dodge the WDDM watchdog |
| Comparator | ComplEx (trained) — underperformed RotatE overall (MRR 0.177 vs 0.283) and on every relation including the 1-to-N biosemiotic ones (see §09). Box / hyperbolic remains the open comparator. |
Training the ~490K-triple / 74K-entity graph at dim 128 takes roughly 15–35 minutes on a single modern GPU (~2–5 s/epoch).
| Snapshot | Triples | Entities | Rel. | MRR | H@1 | H@3 | H@10 |
|---|---|---|---|---|---|---|---|
| stage1 — GLOBI only | 368,983 | 58,167 | 21 | 0.2645 | 0.167 | 0.291 | 0.465 |
| stage2 — +Mangal | 444,979 | 68,183 | 25 | 0.2824 | 0.182 | 0.314 | 0.487 |
| stage3 — +biosemiotic | 489,871 | 74,147 | 27 | 0.2827 | 0.184 | 0.315 | 0.484 |
Mangal enrichment improved every metric (MRR +6.8%, H@10 +4.7% over GLOBI-only). Adding the biosemiotic layer left the headline metrics flat — expected, since the new relations are heavily 1-to-N, where RotatE is weak. Crucially, they did not degrade the conventional structure.
indicatorOf is placed in the geometry more consistently than pollination, symbiosis, mutualism and mycorrhizae — it is genuinely learned. keystone scores highest, but on only 17 test edges.
Narrow and defensible — not “first biosemiotic AI” in general. Here is exactly what is true today: