Semantic Indexing & Vocabulary Quality
How the problem was discovered
As Indexly databases grew, users reported that:
- searches felt noisy
- relevance degraded over time
- common numbers dominated results
To diagnose this, Indexly introduced:
indexly analyze-db fts.index.db --table file_index_vocab
This exposed the actual vocabulary used by FTS, not assumptions.
The results were clear:
- ~75% of indexed terms appeared in only 1–2 documents
- numeric-only tokens dominated frequency
- ranking behavior became statistically unstable
The root cause
FTS was not malfunctioning. Indexly was indexing everything equally, including:
- timestamps
- EXIF data
- dimensions
- IDs and counters
FTS cannot distinguish meaning — it only indexes what it receives.
The semantic indexing model
Indexly now classifies all text into three semantic tiers:
Tier 1 — Human text
paragraphs, sentences, documents
Tier 2 — Semantic metadata
title, author, subject, camera, format
Tier 3 — Technical metadata
timestamps, GPS, dimensions, hashes
Only Tier 1 and Tier 2 are allowed into full-text search. Tier 3 is stored, queryable, but never indexed as text.
Where filtering happens
Semantic filtering is applied once, immediately after extraction:
extract_text_from_file()
↓
semantic pre-filter
├─ Tier 1 → clean_content
├─ Tier 2 → filtered semantic text
└─ Tier 3 → structured metadata only
This guarantees:
- consistent behavior across file types
- no duplicated logic
- predictable relevance
Database impact (and why an update is required)
The database schema remains intentionally stable:
| Field | Responsibility |
|---|---|
content |
Tier 1 + Tier 2 |
clean_content |
Tier 1 only |
file_metadata |
Tier 2 + Tier 3 |
However, existing databases contain polluted vocabularies created before semantic filtering.
From v1.0.6 onward, users must run:
indexly update-db --db fts.index.db
This migrates the database to support clean semantic indexing. For more Information on migration, see Update-db Utility
Without updating, legacy databases retain noisy vocabularies and cannot fully benefit from semantic indexing.
Measured results (real data)
| Metric | With filtering | Without |
|---|---|---|
| Unique terms | ↓ drastically | inflated |
| Numeric dominance | ↓ ~10× | extreme |
| Ranking stability | high | unstable |
Across different database sizes, the distribution shape consistently improves.
Why this works
Semantic indexing ensures that:
- search terms represent intent
- metadata enhances results instead of polluting them
- performance scales predictably
FTS now reflects human meaning — not file internals.
What this unlocks next
- precision search via
clean_content - hybrid text + metadata queries
- relevance tuning
- long-term index health analytics