Cleaning CSV Data with Indexly

Automate CSV data cleaning in Indexly with intelligent type inference, datetime normalization, missing value imputation, and persistence. Ideal for data analysts and Python developers.

Indexly’s Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization — all seamlessly integrated with `analyze-csv`.

🎯 Overview

The --auto-clean flag in indexly analyze-csv enables a robust preprocessing pipeline that:

*Detects and normalizes mixed datetime formats *Infers *data types *(numeric, categorical, datetime) *Fills missing values using statistical imputation *Summarizes the cleaning results in a rich table *Optionally *saves cleaned data *for later reuse

💡 Clean data can be visualized before or after cleaning — see Visual Exploration.

✨ Key Highlights

Capability	Description
🧠 Type Inference	Automatically detects numeric, string, and datetime columns
📅 Datetime Parsing	Dynamically parses mixed formats using user-provided patterns
🧮 NaN Imputation	Fills missing numeric/categorical data using mean/median/mode
⚖️ Threshold Validation	Skips unreliable columns based on valid ratio thresholds
📊 Summary Reporting	Renders terminal tables showing actions taken on each column
💾 Persistence	Saves cleaned datasets for future analysis with `--use-cleaned`

Quick Start Example

Example CSV (`mixed _dates.csv`)

User,Start _Date,End _Timestamp,Notes
Alice,12/05/2021,2021-05-20 14:00:00,Normal entry
Bob,2021/06/01,2021-06-02T09:30:00,Manual import
Charlie,05-07-2021,2021.07.10,Missing format
David,13.08.2021,,Invalid time
Eva,,2021-08-25 17:15:00,Skipped row

Run Command

indexly analyze-csv mixed _dates.csv
  --auto-clean
  --datetime-formats "%d/%m/%Y" "%Y-%m-%d %H:%M:%S" "%Y/%m/%d" "%m-%d-%Y" "%d.%m.%Y" "%Y-%m-%dT%H:%M:%S"
  --date-threshold 0.1
  --show-summary
  --save-data

Example Output

CSV Analysis ⚙️ Running robust cleaning pipeline using MEAN fill method...

⚠️ Skipped 'Start _Date' — less than 60% valid dates (20.0%)
⚠️ Skipped 'End _Timestamp' — less than 60% valid dates (20.0%)
✅ Cleaning complete: 5 rows remain (0 duplicates removed)

                       🧼 Cleaning Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Column        ┃ Type   ┃ Action ┃ NaNs Filled ┃ Fill Strategy ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ User          │ string │ none   │           0 │ -             │
│ Start_Date    │ string │ none   │           0 │ -             │
│ End_Timestamp │ string │ none   │           0 │ -             │
│ Notes         │ string │ none   │           0 │ -             │
└───────────────┴────────┴────────┴─────────────┴───────────────┘

💾 Cleaned data saved for future reuse
⚠️ No numeric or datetime-derived columns found.

Cleaning Logic Explained

The pipeline works in three primary stages:

1 . Type Inference

Infers likely column types using heuristics and pandas’ dtype inference.

df =  _infer _types(df)

*Converts numeric strings to floats/ints where possible *Flags likely date columns before parsing

2 . Datetime Normalization

Automatically parses mixed date formats:

df, date _summaries =  _auto _parse _dates(
    df,
    date _formats=date _formats,
    min _valid _ratio=0.3
)

If a column has fewer than 30% valid datetimes (controlled via --date-threshold), it is skipped and retained as string.

Derived columns can also be generated:

<col> _year
<col> _month
<col> _day
<col> _weekday

3. Missing Value Imputation

Fills missing numeric/categorical data:

Strategy	Description	Example
mean	replaces NaNs with column mean	height = 172.4
median	replaces NaNs with column median	salary = 52,000
mode	replaces NaNs with most frequent value	country = “DE”

CLI control:

--fill-method mean|median|mode

Validation & Thresholds

Columns are only parsed as datetime or numeric if the ratio of valid values exceeds the threshold (default 0.3).

--date-threshold 0.1

For instance, if fewer than 10% of entries in Start _Date are valid, the column is skipped with a warning.

Visual Feedback

The cleaning process generates real-time feedback in the console:

ASCII summary table(as shown above)
Warnings for skipped or invalid columns
Counts of NaNs filled, duplicate removal
Optional histograms in visualization mode (see Visual Exploration)

Export & Reuse

After cleaning, the processed dataset is saved optionally with (–save-data) to Indexly’s SQLite store:

indexly analyze-csv dataset.csv --use-cleaned

This avoids reprocessing the same dataset repeatedly.

Behind the Scenes (Developer Notes)

`_handle _datetime _columns()`

Parses dates with flexible formats and skips those below validity threshold.

def  _handle _datetime _columns(df, date _formats, min _valid _ratio=0.3):
    for col in df.columns:
        parsed = pd.to _datetime(df [col], format=fmt, errors="coerce")
        valid _ratio = parsed.notna().mean()
        if valid _ratio >= min _valid _ratio:
            df [col] = parsed
        else:
            console.print(f"⚠️ Skipped '{col}' — less than {min _valid _ratio *100:.0f}% valid dates")
    return df

`_summarize _cleaning _results()`

Builds structured summary tables with cleaning actions and statistics.

def  _summarize _cleaning _results(df, summary _records):
    table = Table(title="🧼 Cleaning Summary")
    for col, dtype, action, nan _count, strategy in summary _records:
        table.add _row(col, dtype, action, str(nan _count), strategy)
    console.print(table)

`_infer _types()`

Lightweight dtype inference helper for early normalization.

def  _infer _types(df):
    for col in df.columns:
        try:
            df [col] = pd.to _numeric(df [col])
        except Exception:
            pass
    return df

`auto _clean _csv()`

Top-level orchestrator coordinating the entire cleaning workflow.

def auto _clean _csv(df, file _path, method="mean", save _cleaned=False, date _formats=None):
    df =  _infer _types(df)
    df, date _summaries =  _auto _parse _dates(df, date _formats, min _valid _ratio=0.3)
    df =  _fill _missing _values(df, method)
     _summarize _cleaning _results(df, date _summaries)
    if save _cleaned:
         _save _cleaned _dataset(df, file _path)
    return df

Pro Tips

Combine --auto-clean with --visualize to instantly inspect cleaned distributions
Use --date-threshold 0.1 for tolerant datetime detection on mixed sources
Reuse cleaned data with --use-cleaned to skip repetitive parsing
Adjust fill strategy for skewed data: --fill-method median

Next Steps

Continue with Analyze CSV Visualization
Explore Statistical Transformation & Scaling
Learn about Data Tagging & Metadata Indexing

Summary

The Indexly Auto Clean module provides a statistically grounded, extensible preprocessing pipeline designed for both command-line use and programmatic workflows. Whether you’re preparing raw exports, sensor logs, or mixed-format spreadsheets — --auto-clean ensures your data is ready for immediate visualization and analysis.

Cleaning CSV Data with Indexly

Categories:

Tags:

Indexly’s Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization — all seamlessly integrated with `analyze-csv`.

🎯 Overview

✨ Key Highlights

Quick Start Example

Example CSV (`mixed _dates.csv`)

Run Command

Example Output

Cleaning Logic Explained

1 . Type Inference

2 . Datetime Normalization

3. Missing Value Imputation

Validation & Thresholds

Visual Feedback

Export & Reuse

Behind the Scenes (Developer Notes)

`_handle _datetime _columns()`

`_summarize _cleaning _results()`

`_infer _types()`

`auto _clean _csv()`

Pro Tips

Next Steps

Summary

Cleaning CSV Data with Indexly

Indexly’s Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization — all seamlessly integrated with analyze-csv.

🎯 Overview

✨ Key Highlights

Quick Start Example

Example CSV (mixed _dates.csv)

Run Command

Example Output

Cleaning Logic Explained

1 . Type Inference

2 . Datetime Normalization

3. Missing Value Imputation

Validation & Thresholds

Visual Feedback

Export & Reuse

Behind the Scenes (Developer Notes)

_handle _datetime _columns()

_summarize _cleaning _results()

_infer _types()

auto _clean _csv()

Pro Tips

Next Steps

Summary

Indexly’s Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization — all seamlessly integrated with `analyze-csv`.

Example CSV (`mixed _dates.csv`)

`_handle _datetime _columns()`

`_summarize _cleaning _results()`

`_infer _types()`

`auto _clean _csv()`