Cleaning CSV Data with Indexly
Categories:
Indexlyโs Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization โ all seamlessly integrated with analyze-csv.
๐ฏ Overview
The --auto-clean flag in indexly analyze-csv enables a robust preprocessing pipeline that:
*Detects and normalizes mixed datetime formats *Infers *data types *(numeric, categorical, datetime) *Fills missing values using statistical imputation *Summarizes the cleaning results in a rich table *Optionally *saves cleaned data *for later reuse
๐ก Clean data can be visualized before or after cleaning โ see Visual Exploration.
โจ Key Highlights
| Capability | Description |
|---|---|
| ๐ง Type Inference | Automatically detects numeric, string, and datetime columns |
| ๐ Datetime Parsing | Dynamically parses mixed formats using user-provided patterns |
| ๐งฎ NaN Imputation | Fills missing numeric/categorical data using mean/median/mode |
| โ๏ธ Threshold Validation | Skips unreliable columns based on valid ratio thresholds |
| ๐ Summary Reporting | Renders terminal tables showing actions taken on each column |
| ๐พ Persistence | Saves cleaned datasets for future analysis with --use-cleaned |
Quick Start Example
Example CSV (mixed _dates.csv)
User,Start _Date,End _Timestamp,Notes
Alice,12/05/2021,2021-05-20 14:00:00,Normal entry
Bob,2021/06/01,2021-06-02T09:30:00,Manual import
Charlie,05-07-2021,2021.07.10,Missing format
David,13.08.2021,,Invalid time
Eva,,2021-08-25 17:15:00,Skipped row
Run Command
indexly analyze-csv mixed _dates.csv
--auto-clean
--datetime-formats "%d/%m/%Y" "%Y-%m-%d %H:%M:%S" "%Y/%m/%d" "%m-%d-%Y" "%d.%m.%Y" "%Y-%m-%dT%H:%M:%S"
--date-threshold 0.1
--show-summary
--save-data
Example Output
CSV Analysis โ๏ธ Running robust cleaning pipeline using MEAN fill method...
โ ๏ธ Skipped 'Start _Date' โ less than 60% valid dates (20.0%)
โ ๏ธ Skipped 'End _Timestamp' โ less than 60% valid dates (20.0%)
โ
Cleaning complete: 5 rows remain (0 duplicates removed)
๐งผ Cleaning Summary
โโโโโโโโโโโโโโโโโณโโโโโโโโโณโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโ
โ Column โ Type โ Action โ NaNs Filled โ Fill Strategy โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ User โ string โ none โ 0 โ - โ
โ Start_Date โ string โ none โ 0 โ - โ
โ End_Timestamp โ string โ none โ 0 โ - โ
โ Notes โ string โ none โ 0 โ - โ
โโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
๐พ Cleaned data saved for future reuse
โ ๏ธ No numeric or datetime-derived columns found.
Cleaning Logic Explained
The pipeline works in three primary stages:
1 . Type Inference
Infers likely column types using heuristics and pandasโ dtype inference.
df = _infer _types(df)
*Converts numeric strings to floats/ints where possible *Flags likely date columns before parsing
2 . Datetime Normalization
Automatically parses mixed date formats:
df, date _summaries = _auto _parse _dates(
df,
date _formats=date _formats,
min _valid _ratio=0.3
)
If a column has fewer than 30% valid datetimes (controlled via --date-threshold), it is skipped and retained as string.
Derived columns can also be generated:
<col> _year<col> _month<col> _day<col> _weekday
3. Missing Value Imputation
Fills missing numeric/categorical data:
| Strategy | Description | Example |
|---|---|---|
| mean | replaces NaNs with column mean | height = 172.4 |
| median | replaces NaNs with column median | salary = 52,000 |
| mode | replaces NaNs with most frequent value | country = “DE” |
CLI control:
--fill-method mean|median|mode
Validation & Thresholds
Columns are only parsed as datetime or numeric if the ratio of valid values exceeds the threshold (default 0.3).
--date-threshold 0.1
For instance, if fewer than 10% of entries in Start _Date are valid, the column is skipped with a warning.
Visual Feedback
The cleaning process generates real-time feedback in the console:
- ASCII summary table(as shown above)
- Warnings for skipped or invalid columns
- Counts of NaNs filled, duplicate removal
- Optional histograms in visualization mode (see Visual Exploration)
Export & Reuse
After cleaning, the processed dataset is saved optionally with (–save-data) to Indexlyโs SQLite store:
indexly analyze-csv dataset.csv --use-cleaned
This avoids reprocessing the same dataset repeatedly.
Behind the Scenes (Developer Notes)
_handle _datetime _columns()
Parses dates with flexible formats and skips those below validity threshold.
def _handle _datetime _columns(df, date _formats, min _valid _ratio=0.3):
for col in df.columns:
parsed = pd.to _datetime(df [col], format=fmt, errors="coerce")
valid _ratio = parsed.notna().mean()
if valid _ratio >= min _valid _ratio:
df [col] = parsed
else:
console.print(f"โ ๏ธ Skipped '{col}' โ less than {min _valid _ratio *100:.0f}% valid dates")
return df
_summarize _cleaning _results()
Builds structured summary tables with cleaning actions and statistics.
def _summarize _cleaning _results(df, summary _records):
table = Table(title="๐งผ Cleaning Summary")
for col, dtype, action, nan _count, strategy in summary _records:
table.add _row(col, dtype, action, str(nan _count), strategy)
console.print(table)
_infer _types()
Lightweight dtype inference helper for early normalization.
def _infer _types(df):
for col in df.columns:
try:
df [col] = pd.to _numeric(df [col])
except Exception:
pass
return df
auto _clean _csv()
Top-level orchestrator coordinating the entire cleaning workflow.
def auto _clean _csv(df, file _path, method="mean", save _cleaned=False, date _formats=None):
df = _infer _types(df)
df, date _summaries = _auto _parse _dates(df, date _formats, min _valid _ratio=0.3)
df = _fill _missing _values(df, method)
_summarize _cleaning _results(df, date _summaries)
if save _cleaned:
_save _cleaned _dataset(df, file _path)
return df
Pro Tips
- Combine
--auto-cleanwith--visualizeto instantly inspect cleaned distributions - Use
--date-threshold 0.1for tolerant datetime detection on mixed sources - Reuse cleaned data with
--use-cleanedto skip repetitive parsing - Adjust fill strategy for skewed data:
--fill-method median
Next Steps
- Continue with Analyze CSV Visualization
- Explore Statistical Transformation & Scaling
- Learn about Data Tagging & Metadata Indexing
Summary
The Indexly Auto Clean module provides a statistically grounded, extensible preprocessing pipeline designed for both command-line use and programmatic workflows. Whether youโre preparing raw exports, sensor logs, or mixed-format spreadsheets โ --auto-clean ensures your data is ready for immediate visualization and analysis.