Semantic Indexing & Vocabulary Quality

How the problem was discovered

As Indexly databases grew, users reported that:

searches felt noisy
relevance degraded over time
common numbers dominated results

To diagnose this, Indexly introduced:

indexly analyze-db fts.index.db --table file_index_vocab

This exposed the actual vocabulary used by FTS, not assumptions.

The results were clear:

~75% of indexed terms appeared in only 1–2 documents
numeric-only tokens dominated frequency
ranking behavior became statistically unstable

The root cause

FTS was not malfunctioning. Indexly was indexing everything equally, including:

timestamps
EXIF data
dimensions
IDs and counters

FTS cannot distinguish meaning — it only indexes what it receives.

The semantic indexing model

Indexly now classifies all text into three semantic tiers:

Tier 1 — Human text
  paragraphs, sentences, documents

Tier 2 — Semantic metadata
  title, author, subject, camera, format

Tier 3 — Technical metadata
  timestamps, GPS, dimensions, hashes

Only Tier 1 and Tier 2 are allowed into full-text search. Tier 3 is stored, queryable, but never indexed as text.

Where filtering happens

Semantic filtering is applied once, immediately after extraction:

extract_text_from_file()
        ↓
semantic pre-filter
        ├─ Tier 1 → clean_content
        ├─ Tier 2 → filtered semantic text
        └─ Tier 3 → structured metadata only

This guarantees:

consistent behavior across file types
no duplicated logic
predictable relevance

Database impact (and why an update is required)

The database schema remains intentionally stable:

Field	Responsibility
`content`	Tier 1 + Tier 2
`clean_content`	Tier 1 only
`file_metadata`	Tier 2 + Tier 3

However, existing databases contain polluted vocabularies created before semantic filtering.

From v1.0.6 onward, users must run:

indexly update-db --db fts.index.db

This migrates the database to support clean semantic indexing. For more Information on migration, see Update-db Utility

Without updating, legacy databases retain noisy vocabularies and cannot fully benefit from semantic indexing.

Measured results (real data)

Metric	With filtering	Without
Unique terms	↓ drastically	inflated
Numeric dominance	↓ ~10×	extreme
Ranking stability	high	unstable

Across different database sizes, the distribution shape consistently improves.

Why this works

Semantic indexing ensures that:

search terms represent intent
metadata enhances results instead of polluting them
performance scales predictably

FTS now reflects human meaning — not file internals.

What this unlocks next

precision search via clean_content
hybrid text + metadata queries
relevance tuning
long-term index health analytics

👉 Ignore Rules & Index Hygiene