Cleaning CSV Data with Indexly

Automate CSV data cleaning in Indexly with intelligent type inference, datetime normalization, missing value imputation, and persistence. Ideal for data analysts and Python developers.

Indexlyโ€™s Auto Clean pipeline transforms messy CSV files into analysis-ready datasets with type inference, missing value imputation, and datetime normalization โ€” all seamlessly integrated with analyze-csv.

๐ŸŽฏ Overview

The --auto-clean flag in indexly analyze-csv enables a robust preprocessing pipeline that:

*Detects and normalizes mixed datetime formats *Infers *data types *(numeric, categorical, datetime) *Fills missing values using statistical imputation *Summarizes the cleaning results in a rich table *Optionally *saves cleaned data *for later reuse

๐Ÿ’ก Clean data can be visualized before or after cleaning โ€” see Visual Exploration.


โœจ Key Highlights

Capability Description
๐Ÿง  Type Inference Automatically detects numeric, string, and datetime columns
๐Ÿ“… Datetime Parsing Dynamically parses mixed formats using user-provided patterns
๐Ÿงฎ NaN Imputation Fills missing numeric/categorical data using mean/median/mode
โš–๏ธ Threshold Validation Skips unreliable columns based on valid ratio thresholds
๐Ÿ“Š Summary Reporting Renders terminal tables showing actions taken on each column
๐Ÿ’พ Persistence Saves cleaned datasets for future analysis with --use-cleaned

Quick Start Example

Example CSV (mixed _dates.csv)

User,Start _Date,End _Timestamp,Notes
Alice,12/05/2021,2021-05-20 14:00:00,Normal entry
Bob,2021/06/01,2021-06-02T09:30:00,Manual import
Charlie,05-07-2021,2021.07.10,Missing format
David,13.08.2021,,Invalid time
Eva,,2021-08-25 17:15:00,Skipped row

Run Command

indexly analyze-csv mixed _dates.csv
  --auto-clean
  --datetime-formats "%d/%m/%Y" "%Y-%m-%d %H:%M:%S" "%Y/%m/%d" "%m-%d-%Y" "%d.%m.%Y" "%Y-%m-%dT%H:%M:%S"
  --date-threshold 0.1
  --show-summary
  --save-data

Example Output

CSV Analysis โš™๏ธ Running robust cleaning pipeline using MEAN fill method...

โš ๏ธ Skipped 'Start _Date' โ€” less than 60% valid dates (20.0%)
โš ๏ธ Skipped 'End _Timestamp' โ€” less than 60% valid dates (20.0%)
โœ… Cleaning complete: 5 rows remain (0 duplicates removed)

                       ๐Ÿงผ Cleaning Summary
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Column        โ”ƒ Type   โ”ƒ Action โ”ƒ NaNs Filled โ”ƒ Fill Strategy โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ User          โ”‚ string โ”‚ none   โ”‚           0 โ”‚ -             โ”‚
โ”‚ Start_Date    โ”‚ string โ”‚ none   โ”‚           0 โ”‚ -             โ”‚
โ”‚ End_Timestamp โ”‚ string โ”‚ none   โ”‚           0 โ”‚ -             โ”‚
โ”‚ Notes         โ”‚ string โ”‚ none   โ”‚           0 โ”‚ -             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’พ Cleaned data saved for future reuse
โš ๏ธ No numeric or datetime-derived columns found.

Cleaning Logic Explained

The pipeline works in three primary stages:

1 . Type Inference

Infers likely column types using heuristics and pandasโ€™ dtype inference.

df =  _infer _types(df)

*Converts numeric strings to floats/ints where possible *Flags likely date columns before parsing


2 . Datetime Normalization

Automatically parses mixed date formats:

df, date _summaries =  _auto _parse _dates(
    df,
    date _formats=date _formats,
    min _valid _ratio=0.3
)

If a column has fewer than 30% valid datetimes (controlled via --date-threshold), it is skipped and retained as string.

Derived columns can also be generated:

  • <col> _year
  • <col> _month
  • <col> _day
  • <col> _weekday

3. Missing Value Imputation

Fills missing numeric/categorical data:

Strategy Description Example
mean replaces NaNs with column mean height = 172.4
median replaces NaNs with column median salary = 52,000
mode replaces NaNs with most frequent value country = “DE”

CLI control:

--fill-method mean|median|mode

Validation & Thresholds

Columns are only parsed as datetime or numeric if the ratio of valid values exceeds the threshold (default 0.3).

--date-threshold 0.1

For instance, if fewer than 10% of entries in Start _Date are valid, the column is skipped with a warning.


Visual Feedback

The cleaning process generates real-time feedback in the console:

  • ASCII summary table(as shown above)
  • Warnings for skipped or invalid columns
  • Counts of NaNs filled, duplicate removal
  • Optional histograms in visualization mode (see Visual Exploration)

Export & Reuse

After cleaning, the processed dataset is saved optionally with (–save-data) to Indexlyโ€™s SQLite store:

indexly analyze-csv dataset.csv --use-cleaned

This avoids reprocessing the same dataset repeatedly.


Behind the Scenes (Developer Notes)

_handle _datetime _columns()

Parses dates with flexible formats and skips those below validity threshold.

def  _handle _datetime _columns(df, date _formats, min _valid _ratio=0.3):
    for col in df.columns:
        parsed = pd.to _datetime(df [col], format=fmt, errors="coerce")
        valid _ratio = parsed.notna().mean()
        if valid _ratio >= min _valid _ratio:
            df [col] = parsed
        else:
            console.print(f"โš ๏ธ Skipped '{col}' โ€” less than {min _valid _ratio *100:.0f}% valid dates")
    return df

_summarize _cleaning _results()

Builds structured summary tables with cleaning actions and statistics.

def  _summarize _cleaning _results(df, summary _records):
    table = Table(title="๐Ÿงผ Cleaning Summary")
    for col, dtype, action, nan _count, strategy in summary _records:
        table.add _row(col, dtype, action, str(nan _count), strategy)
    console.print(table)

_infer _types()

Lightweight dtype inference helper for early normalization.

def  _infer _types(df):
    for col in df.columns:
        try:
            df [col] = pd.to _numeric(df [col])
        except Exception:
            pass
    return df

auto _clean _csv()

Top-level orchestrator coordinating the entire cleaning workflow.

def auto _clean _csv(df, file _path, method="mean", save _cleaned=False, date _formats=None):
    df =  _infer _types(df)
    df, date _summaries =  _auto _parse _dates(df, date _formats, min _valid _ratio=0.3)
    df =  _fill _missing _values(df, method)
     _summarize _cleaning _results(df, date _summaries)
    if save _cleaned:
         _save _cleaned _dataset(df, file _path)
    return df

Pro Tips

  • Combine --auto-clean with --visualize to instantly inspect cleaned distributions
  • Use --date-threshold 0.1 for tolerant datetime detection on mixed sources
  • Reuse cleaned data with --use-cleaned to skip repetitive parsing
  • Adjust fill strategy for skewed data: --fill-method median

Next Steps


Summary

The Indexly Auto Clean module provides a statistically grounded, extensible preprocessing pipeline designed for both command-line use and programmatic workflows. Whether youโ€™re preparing raw exports, sensor logs, or mixed-format spreadsheets โ€” --auto-clean ensures your data is ready for immediate visualization and analysis.