API Reference
This section provides the auto-generated API reference documentation for the ETLForge library.
ETLForge - A Python library for generating test data and validating ETL outputs.
- class etl_forge.DataGenerator(schema_path: str | Path | dict | None = None)[source]
Generates synthetic test data based on a declarative schema.
This class reads a YAML or JSON schema, generates data according to the specified types and constraints, and can save the output to CSV or Excel.
- load_schema(schema_path: str | Path | dict)[source]
Loads and validates a schema from a file path or a dictionary.
This method supports both file paths (YAML or JSON) and direct dictionary objects as input. It orchestrates the loading and subsequent validation of the schema.
The schema can be in ETLForge native format, Frictionless Table Schema, or JSON Schema format. The format is auto-detected and converted to ETLForge format if necessary.
- Parameters:
schema_path – The path to a YAML/JSON schema file or a dictionary containing the schema definition. Supports ETLForge native format, Frictionless Table Schema, and JSON Schema.
- Raises:
ETLForgeError – If schema_path points to a file that is not found, has an unsupported extension, or if the file cannot be parsed due to syntax errors or I/O issues. Also raised if the loaded schema fails any validation checks.
- generate_data(num_rows: int) DataFrame[source]
Generates a pandas DataFrame with synthetic data.
This is the main method for data generation. It iterates through the fields defined in the schema and generates data for each column.
- Parameters:
num_rows – The number of rows of data to generate.
- Returns:
A pandas DataFrame containing the synthetic data.
- Raises:
ETLForgeError – If no schema has been loaded, if an unsupported field type is encountered, or if data generation fails for a specific column (e.g., unable to generate enough unique values for the given constraints).
- save_data(df: DataFrame, output_path: str | Path, file_format: str | None = None)[source]
Saves the generated DataFrame to a file (CSV or Excel).
- Parameters:
df – The pandas DataFrame to save.
output_path – The destination file path.
file_format – The output format (‘csv’ or ‘excel’). If not provided, it is inferred from the file extension of output_path.
- Raises:
ETLForgeError – If the file format is unsupported or if an error occurs during file writing.
- generate_and_save(num_rows: int, output_path: str | Path, file_format: str | None = None) DataFrame[source]
Generates data and saves it to a file in a single step.
- Parameters:
num_rows – The number of rows of data to generate.
output_path – The destination file path.
file_format – The output format (‘csv’ or ‘excel’). If not provided, it is inferred from the file extension.
- Returns:
The generated pandas DataFrame.
- class etl_forge.DataValidator(schema_path: str | Path | dict | None = None)[source]
Validates tabular data (pandas DataFrames) against a declarative schema.
This class reads a schema and validates pandas DataFrames, performing a series of validation checks to ensure the data conforms to the schema’s specifications.
- load_schema(schema_path: str | Path | dict)[source]
Loads a schema from a file path or a dictionary.
The schema can be in ETLForge native format, Frictionless Table Schema, or JSON Schema format. The format is auto-detected and converted to ETLForge format if necessary.
- Parameters:
schema_path – The path to a YAML/JSON schema file or a dictionary containing the schema definition. Supports ETLForge native format, Frictionless Table Schema, and JSON Schema.
- Raises:
ETLForgeError – If the schema file is not found, has an unsupported format, or cannot be parsed.
- validate(df: DataFrame) ValidationResult[source]
Validates a pandas DataFrame against the loaded schema.
This is the main validation method. It runs all configured validation checks.
- Parameters:
df – A pandas DataFrame to validate.
- Returns:
A ValidationResult object containing the detailed results of the validation run.
- Raises:
ETLForgeError – If no schema has been loaded or if df is not a DataFrame.
- validate_and_report(df: DataFrame, report_path: str | None = None) ValidationResult[source]
Validates a pandas DataFrame and optionally saves a report of invalid rows.
- Parameters:
df – A pandas DataFrame to validate.
report_path – The destination file path for the invalid rows report (CSV format). If None, no report is saved.
- Returns:
A ValidationResult object containing the detailed results.
- Raises:
ETLForgeError – If an error occurs while writing the report file.
- print_validation_summary(result: ValidationResult)[source]
Print a summary of validation results.
- class etl_forge.SchemaAdapter[source]
Base class for schema adapters.
Schema adapters convert schemas from established standards to ETLForge’s internal format.
- static detect_schema_type(schema: Dict[str, Any]) str[source]
Detect the type of schema based on its structure.
- Parameters:
schema – A dictionary containing the schema definition.
- Returns:
‘etlforge’, ‘frictionless’, ‘jsonschema’, or ‘unknown’
- Return type:
One of
- static load_and_convert(schema_path: str | Path | dict) Dict[str, Any][source]
Load a schema from a file or dict and convert to ETLForge format.
This method auto-detects the schema type and applies the appropriate conversion.
- Parameters:
schema_path – The path to a schema file or a dictionary.
- Returns:
A schema dictionary in ETLForge format.
- Raises:
ETLForgeError – If the schema cannot be loaded or converted.
- class etl_forge.FrictionlessAdapter[source]
Adapter for Frictionless Table Schema.
Frictionless Table Schema is a standard for describing tabular data. Spec: https://specs.frictionlessdata.io/table-schema/
Supported Frictionless types and their ETLForge mappings: - integer -> int - number -> float - string -> string - date/datetime/time -> date - boolean -> category (with values [True, False]) - array/object -> Not supported (raises error)
Supported constraints: - required -> nullable (inverted) - unique -> unique - minimum/maximum -> range.min/range.max - minLength/maxLength -> length.min/length.max - enum -> values (for category type) - pattern -> Not directly supported (logged as warning)
- classmethod convert(schema: Dict[str, Any]) Dict[str, Any][source]
Convert a Frictionless Table Schema to ETLForge format.
- Parameters:
schema – A Frictionless Table Schema dictionary.
- Returns:
An ETLForge-compatible schema dictionary.
- Raises:
ETLForgeError – If the schema contains unsupported types.
- class etl_forge.JsonSchemaAdapter[source]
Adapter for JSON Schema.
JSON Schema is a widely-adopted standard for describing JSON data. Spec: https://json-schema.org/
This adapter supports JSON Schema Draft-07 and later for describing tabular data where each row is an object with properties.
Supported JSON Schema types and their ETLForge mappings: - integer -> int - number -> float - string -> string (or date if format is date/date-time) - boolean -> category (with values [true, false]) - array/object -> Not supported as field types
Supported keywords: - required -> nullable (inverted) - minimum/maximum -> range.min/range.max - exclusiveMinimum/exclusiveMaximum -> adjusted range - minLength/maxLength -> length.min/length.max - enum -> category type with values - format (date, date-time, email, etc.) -> type hints
- classmethod convert(schema: Dict[str, Any]) Dict[str, Any][source]
Convert a JSON Schema to ETLForge format.
- Parameters:
schema – A JSON Schema dictionary describing a tabular row structure.
- Returns:
An ETLForge-compatible schema dictionary.
- Raises:
ETLForgeError – If the schema cannot be converted.
Generator
Data generator module for creating synthetic test data based on schema definitions.
- class etl_forge.generator.DataGenerator(schema_path: str | Path | dict | None = None)[source]
Bases:
objectGenerates synthetic test data based on a declarative schema.
This class reads a YAML or JSON schema, generates data according to the specified types and constraints, and can save the output to CSV or Excel.
- load_schema(schema_path: str | Path | dict)[source]
Loads and validates a schema from a file path or a dictionary.
This method supports both file paths (YAML or JSON) and direct dictionary objects as input. It orchestrates the loading and subsequent validation of the schema.
The schema can be in ETLForge native format, Frictionless Table Schema, or JSON Schema format. The format is auto-detected and converted to ETLForge format if necessary.
- Parameters:
schema_path – The path to a YAML/JSON schema file or a dictionary containing the schema definition. Supports ETLForge native format, Frictionless Table Schema, and JSON Schema.
- Raises:
ETLForgeError – If schema_path points to a file that is not found, has an unsupported extension, or if the file cannot be parsed due to syntax errors or I/O issues. Also raised if the loaded schema fails any validation checks.
- generate_data(num_rows: int) DataFrame[source]
Generates a pandas DataFrame with synthetic data.
This is the main method for data generation. It iterates through the fields defined in the schema and generates data for each column.
- Parameters:
num_rows – The number of rows of data to generate.
- Returns:
A pandas DataFrame containing the synthetic data.
- Raises:
ETLForgeError – If no schema has been loaded, if an unsupported field type is encountered, or if data generation fails for a specific column (e.g., unable to generate enough unique values for the given constraints).
- save_data(df: DataFrame, output_path: str | Path, file_format: str | None = None)[source]
Saves the generated DataFrame to a file (CSV or Excel).
- Parameters:
df – The pandas DataFrame to save.
output_path – The destination file path.
file_format – The output format (‘csv’ or ‘excel’). If not provided, it is inferred from the file extension of output_path.
- Raises:
ETLForgeError – If the file format is unsupported or if an error occurs during file writing.
- generate_and_save(num_rows: int, output_path: str | Path, file_format: str | None = None) DataFrame[source]
Generates data and saves it to a file in a single step.
- Parameters:
num_rows – The number of rows of data to generate.
output_path – The destination file path.
file_format – The output format (‘csv’ or ‘excel’). If not provided, it is inferred from the file extension.
- Returns:
The generated pandas DataFrame.
Validator
Data validator module for validating tabular data (pandas DataFrames) against schema definitions.
- class etl_forge.validator.DataValidator(schema_path: str | Path | dict | None = None)[source]
Bases:
objectValidates tabular data (pandas DataFrames) against a declarative schema.
This class reads a schema and validates pandas DataFrames, performing a series of validation checks to ensure the data conforms to the schema’s specifications.
- load_schema(schema_path: str | Path | dict)[source]
Loads a schema from a file path or a dictionary.
The schema can be in ETLForge native format, Frictionless Table Schema, or JSON Schema format. The format is auto-detected and converted to ETLForge format if necessary.
- Parameters:
schema_path – The path to a YAML/JSON schema file or a dictionary containing the schema definition. Supports ETLForge native format, Frictionless Table Schema, and JSON Schema.
- Raises:
ETLForgeError – If the schema file is not found, has an unsupported format, or cannot be parsed.
- validate(df: DataFrame) ValidationResult[source]
Validates a pandas DataFrame against the loaded schema.
This is the main validation method. It runs all configured validation checks.
- Parameters:
df – A pandas DataFrame to validate.
- Returns:
A ValidationResult object containing the detailed results of the validation run.
- Raises:
ETLForgeError – If no schema has been loaded or if df is not a DataFrame.
- validate_and_report(df: DataFrame, report_path: str | None = None) ValidationResult[source]
Validates a pandas DataFrame and optionally saves a report of invalid rows.
- Parameters:
df – A pandas DataFrame to validate.
report_path – The destination file path for the invalid rows report (CSV format). If None, no report is saved.
- Returns:
A ValidationResult object containing the detailed results.
- Raises:
ETLForgeError – If an error occurs while writing the report file.
- print_validation_summary(result: ValidationResult)[source]
Print a summary of validation results.
Command-Line Interface
Command-line interface for ETLForge.