Biosemiotic knowledge-graph embedding · European ecology

Runa‑1

A geometric map of who relates to whom — the shared substrate beneath nine ecological agents.

WhatA trained RotatE embedding over ~490,000 ecological relationships
StatusTrained & on disk · conventional layer validated · sign-layer learnable, not yet independently validated
Built onPyKEEN · open European data
Last updated2026-06-09 (update-001)
scroll · plain language first
§01 — What it is, plainly

Decades of ecological records, folded into a single space.

Runa-1 takes open European ecological data and turns it into one geometric map of who relates to whom. Every species, and every environmental state, becomes a point in a shared space. Two points sit close together when the things they describe are ecologically related.

A predator lands near its prey’s neighbourhood. A pollinator settles near the plants it visits. A nitrogen-loving plant drifts toward the marker for high-nitrogen soil. Nothing is told where to go — the positions fall out of the relationships themselves.

It learns from typed triples: plain statements of the form (head, relation, tail). Given roughly 490,000 of them, the model arranges every point so that each relation becomes a consistent geometric move. Once that’s done, you can ask questions the source records never stated outright:

one training statement
Bombus terrestris pollinates Salix caprea
~490,000 statements like this become rotations in one shared geometry.
nearest-neighbour
“What is ecologically closest to this species?” — read straight off proximity.
relational
“Who might this species pollinate?” — follow the relation as a geometric operation, head ∘ relation → tail.
gap-filling
Plausible links nobody recorded, suggested by analogy to better-documented systems.

It is not a language model and not a forecaster. It is a knowledge-graph embedding (RotatE, trained with PyKEEN) — closer to a map than to a chatbot.

You shall know a species by the company it keeps.after J. R. Firth, 1957
§02 — Why “biosemiotic”

Species relate not only by contact, but by the signs they share.

On top of the ordinary ecological links — predation, pollination, mycorrhizae — Runa adds sign-relations: species treated as producers and readers of ecological signs within an Umwelt, the perceived world of an organism.

The novel move is small but consequential. These sign-relations are trained as ordinary typed edges, so the model places them in the same geometry as everything else. Two species that read the same environmental sign are pulled together — even if they never directly interact. That “relatedness by shared sign” is what the project is really about.

The intellectual lineage is explicit: Peirce’s index, Uexküll’s function-circle, Hoffmeyer’s semiotic scaffolding. Runa doesn’t try to decode any organism’s inner meaning; it models the relational structure of ecological sign-making, and leaves the inner world untouched.

§03 — At a glance

The artifact, in numbers.

489,871
typed triples
(stage-3 master)
74,147
entities in one
vector space
27 / 52
relation types
carrying data
0.2827
held-out MRR
(filtered)
128d
RotatE embedding
dimension
10
countries of
geographic scope

Three snapshots exist — GLOBI-only, then +Mangal, then +biosemiotic layer. The figures above describe the third. Everything below details how it was built and, just as importantly, what is and isn’t yet proven.

§04 — Entities & relations

Two kinds of node, two layers of relation.

The nodes

Biological entities — taxa, keyed by scientific name (canonical binomials from GLOBI and Mangal; not yet reconciled to the GBIF backbone). State & community nodes — the discrete non-biological targets the sign-relations point at: environmental states like soilNitrogen_b4 or thermalIndicator_b3, and detected communities community_0 … community_89.

Layer 1 — conventional ecology

Around 40 relation types, mapped to the OBO Relations Ontology where possible (PURLs taken from GLOBI’s authoritative ro.tsv): trophic (eats, preysOn), antagonistic (parasiteOf, pathogenOf), mutualistic (pollinates, ectomycorrhizalHostOf), structural (createsHabitatFor, epiphyteOf) and generic (interactsWith, coOccursWith).

Layer 2 — biosemiotic (the novel contribution)

ENVAI-namespaced sign-relations
RelationMeaningTail node
indicatorOfspecies is a sign of an environmental state (Peircean index)env-state
keystoneSignProducerInkeystone sign-producer within a community (Hoffmeyer scaffolding)community
perceivesSignalperceptual boundary of the Umwelt (Uexküll) — not builtsignal
phenologicalIndicatorOfphenological sign — defined, not populatedphenophase

One canonical direction is stored per relation; PyKEEN’s create_inverse_triples synthesises inverses at training time, so both directions score without double-counting. Inverse GLOBI labels (eatenBy, pollinatedBy) are flipped to canonical during ingestion.

§05 — Deriving the sign-edges

The signs aren’t in any database. They’re derived — so the derivation is stated, reproducible, and hash-frozen at deposit.

indicatorOf← EIVE / Ellenberg
Each European vascular plant’s published indicator value on six gradients — Light, Temperature, Moisture, Reaction/pH, Nutrients/N, Salinity — binned into five fixed classes over 0–10. These are published indicator values (expert-harmonised rankings), not raw site measurements: legitimate “measured signs” in the bioindication tradition. 3,055 of 8,908 EIVE plants matched existing graph species, wiring the edges into the interaction web.
keystoneSignProducerIn← graph centrality
No usable external keystone dataset exists. Keystone-ness is operationalised as within-community degree centrality above the 98th percentile on the real interaction graph (Louvain communities). Honest framing: “topologically central within a detected community” — a defensible proxy, not an experimentally demonstrated keystone.
perceivesSignal← deliberately deferred
Perception, not production. No species→perceived-signal dataset exists at scale — only slivers (bird UV-vision from opsin genetics; ~34 species in the Animal Audiograms Database). Xeno-Canto documents signal production, which is not a proxy for perception. Rather than fabricate edges, this relation is left for a future, honestly-scoped perception-only build.
open methodological choice The Notion spec describes an alternative indicatorOf route — Living Planet Index population trends × CHELSA climate shifts. EIVE is the niche / literature-attested route; the two are complementary, and the choice is held open (see §10).
§06 — Data & provenance

Open sources, filtered to a northern-European window.

Sources feeding the stage-3 master
SourceRoleLicenseIn-scope
GLOBI 2026-05-29typed interaction triples (the spine)CC0 / CC-BY per dataset368,983
Mangal v2 APIwhole-network structureopen · per-network DOI79,021
EIVE 1.0Ellenberg values → indicatorOfCC-BY 4.044,590
graph centrality (internal)keystoneSignProducerInderived302
GBIFoccurrence + taxonomy backboneper-record CC0 / CC-BYstreamed

EIVE via Zenodo 10.5281/zenodo.7427088. GBIF: 69 GB zip (~300M+ occurrence rows; 10-country, coords, CC0 / CC-BY) is now streamed for the independent thermal-niche validation (§09). Full co-occurrence and environmental-layer use still deferred.

SENOFIDKISNLDEFRBECH

All nine ENVAI agents are covered, France/Ondine included. GLOBI is filtered by a coarse W-Europe/Nordic bounding box (lat 41–71.5, lon −25–32), to be refined by reverse-geocoding.

§07 — Pipeline (ETL)

From raw records to one master triple file.

Each stage is a module under src/envai/; every triple file carries [head, relation, tail, source, confidence].

  1. stage1_globiStream GLOBI, filter to scope, map raw interaction names to the canonical vocabulary (collapsing inverses), dedupe.
  2. stage3_mangalWalk the Mangal API (network / node / interaction), map types to the vocabulary.
  3. stage_indicatorParse EIVE, bin each 0–10 gradient into 5 classes, match plants to graph entities, emit (plant, indicatorOf, gradient_bin).
  4. stage_keystoneBuild the interaction graph, run Louvain community detection, take the top 2% within-community degree → (species, keystoneSignProducerIn, community).
  5. stage5_taxonomyConcatenate and dedupe sources into a master. GBIF name reconciliation is implemented but deferred (--no-reconcile so far).
  6. train · eval_per_relationRotatE / ComplEx via PyKEEN, then a per-relation metric breakdown.
  7. serve_neo4jPush embeddings to a Neo4j vector index for queries — built, not yet run.

Deferred but coded: stage4_env (CHELSA / LUCAS → respondsTo* nodes), and serving. stage2_gbif extraction had a read-all-into-RAM bug that OOM-killed on the 100 GB CSV — fixed to stream with shutil.copyfileobj.

Independent validation path (new): scripts/extract_gbif_niche.py streams the GBIF zip into per-species realized thermal niches via CHELSA bio1, then envai.validate_thermal computes Spearman of the model thermal-indicator placement against real occurrence temperature. envai.bio_smoke_eval adds a textbook-directional smoke test.

§08 — Embedding model & training

RotatE, tuned to run on one Blackwell card.

Training configuration
ModelRotatE, embedding_dim 128 (complex → 256 real dims/entity)
LossNSSALoss (self-adversarial), margin 9.0, adv. temperature 1.0
OptimizerAdam, lr 1e-3
LoopsLCWA, basic negative sampler, 64 negatives/positive, batch 2048
Epochs300, early stopping (freq 20, patience 5, Δ 0.002, metric = MRR)
Split80 / 10 / 10, random_state 42, inverse triples on
EvalRankBasedEvaluator, bounded (batch 128, slice 2048) to dodge the WDDM watchdog
ComparatorComplEx (trained) — underperformed RotatE overall (MRR 0.177 vs 0.283) and on every relation including the 1-to-N biosemiotic ones (see §09). Box / hyperbolic remains the open comparator.

Training the ~490K-triple / 74K-entity graph at dim 128 takes roughly 15–35 minutes on a single modern GPU (~2–5 s/epoch).

§09 — Results

What the geometry learned — and where it’s honest about plateauing.

Model snapshots · held-out test, filtered ranking
SnapshotTriplesEntitiesRel.MRRH@1H@3H@10
stage1 — GLOBI only368,98358,167210.26450.1670.2910.465
stage2 — +Mangal444,97968,183250.28240.1820.3140.487
stage3 — +biosemiotic489,87174,147270.28270.1840.3150.484

Mangal enrichment improved every metric (MRR +6.8%, H@10 +4.7% over GLOBI-only). Adding the biosemiotic layer left the headline metrics flat — expected, since the new relations are heavily 1-to-N, where RotatE is weak. Crucially, they did not degrade the conventional structure.

Per-relation MRR (stage-3, RotatE)

biosemiotic relation conventional relation

indicatorOf is placed in the geometry more consistently than pollination, symbiosis, mutualism and mycorrhizae — it is genuinely learned. keystone scores highest, but on only 17 test edges.

read this before trusting the bars Both biosemiotic relations have small tail vocabularies (30 env-states, 90 communities), so tail-prediction is easy and inflates their scores relative to the 74K-entity species direction. The numbers show the relations are learnable — not that they capture independent ecological truth.
§10 — Validation status & honest limitations

This section is deliberately strict. It’s what separates an honest artifact from an overclaim.

The biosemiotic ablation, as run, is circular — so it is not yet validation. The held-out sign-edges were generated by the same rule (EIVE values / centrality) the model is then credited with recovering. The §09 result proves the relations are learnable and internally consistent, not that they capture independent truth. A real ablation needs an independent route: a held-out region, or literature-attested indicators the model never trained on. This is the current top priority.
RotatE is weak on 1-to-N and hierarchy. The sign-relations are 1-to-N; a ComplEx (and ideally box / hyperbolic) comparator is needed before trusting their geometry. Taxonomy should stay a structural constraint, never scored triples.
Small tail vocabularies inflate the biosemiotic metrics — see the §09 caveat.
No GBIF taxonomy reconciliation yet. Sources are joined by raw / normalised name strings; cross-source overlap is conservative (3,055 EIVE matches; ~1,900 shared Mangal/GLOBI names). This is the biggest pending quality lever.
Coverage gaps. GBIF co-occurrence and the CHELSA / LUCAS environmental layer are coded but not built; the bounding-box filter is coarse; counts are point-in-time.
Sign-edges are interpretive by construction — hypotheses the embedding helps test, documented with provenance, not ground truth.
§11 — The priority claim

The first trained relational embedding that operationalizes biosemiotic relations for ecology.

Narrow and defensible — not “first biosemiotic AI” in general. Here is exactly what is true today:

The embedding exists, is trained, and is on disk — this resolves the “model vs. plan” question.
The conventional-ecology embedding is validated by held-out link prediction.
Two biosemiotic relations are in the trained geometry and learnable.
Their independent validation (a non-circular ablation) is not done.
perceivesSignal, one of the three named relations, is not built.
for an honest deposit The novelty lives in the relations, not the weights — so a Zenodo / EcoEvoRxiv deposit should include a hash-frozen snapshot of the exact triple set and the ETL scripts alongside the model and schema. Label it precisely: a validated ecological KGE with a learnable, not-yet-independently-validated biosemiotic layer.
§12 — Roadmap

In priority order.

  1. Independent (non-circular) biosemiotic eval — in progressThermal axis under independent validation via GBIF occurrence × CHELSA bio1 (running). Remaining work: the other EIVE gradients, and an LPI × CHELSA population-trend axis. Result is the headline number that will close §09.
  2. GBIF taxonomy reconciliationMerge sources onto the GBIF backbone; the biggest connectivity and quality gain.
  3. Box / hyperbolic comparatorFor the 1-to-N biosemiotic relations — ComplEx already trained and underperformed RotatE; box / hyperbolic remains open.
  4. GBIF co-occurrence + CHELSA / LUCASThe respondsTo* environmental layer (occurrence already downloaded).
  5. ServingNeo4j vector index with nearest-neighbour, analogical and hybrid queries for the nine agents.
  6. HPO & a perceivesSignal buildPerception-only, from audiogram / opsin data.
  7. DepositHash-frozen triples + scripts + schema + model on Zenodo.