DataFrame
DataFrame module: columnar, lazy-evaluated tabular data.
Design: typed columns (IntCol/FloatCol/StrCol/BoolCol + nullable variants), a LazyFrame / Plan ADT for deferred execution, and a ColExpr language for vectorized filter predicates (no row materialization).
Phase 1 : types, construction, column/row access, structural ops Phase 2 : CSV and JSON I/O Phase 3 : ColExpr, LazyFrame, Plan interpreter Phase 4 : GroupBy + aggregation (uses Stats module) Phase 5 : inner_join, left_join, right_join, outer_join Phase 6 : describe, value_counts, col_z_score, col_normalize
Types
Functions
Construct a DataFrame directly from a list of columns. No length check.
Returns true if the column is a nullable variant.
Returns the number of null entries in a nullable column (0 for non-nullable).
Wrap a non-nullable column in a nullable variant (all-false bitmap). No-op on already-nullable columns.
Constructs a DataFrame from a list of pre-built columns. All columns must have equal length. Returns Err if they don't.
Constructs a DataFrame from a list of Row values. Type inference is strict: if a column starts as Int, every subsequent value must also be IntVal. Errors on type mismatch or NullVal in Phase 1. For widening inference (Int→Float→String), use from_rows_widen.
Returns the number of rows. O(1) for single-column frames; O(cols) otherwise.
Returns the column with the given name, or Err if not found.
Returns the Int data list for a named column, or Err.
Returns the Float data list for a named column, or Err.
Returns the String data list for a named column, or Err.
Returns the Bool data list for a named column, or Err.
Returns a column as List(Float) for use with Stats functions. Int columns are promoted.
Materializes row i as a Row snapshot. O(col_count × i). Slow path.
Returns all rows as a list of Row values. O(rows × cols). Slow path.
Look up a field in a Row by name. Returns None if not found.
Look up a Float field in a Row. Int values are promoted to Float.
Look up a String field in a Row.
Adds a column to a DataFrame. Errors if name already exists or length mismatches.
Removes a column by name. No-op if not found.
Renames a column. Errors if old_name not found or new_name already exists.
Returns a DataFrame with columns reordered/selected by name list. Errors on missing names.
Returns rows [start, start+len). Clamps to available rows.
Appends a Filter node that uses a vectorized ColExpr predicate.
Adds or replaces a column computed row-by-row using a March closure.
Appends a sort node. Keys is a list of (column_name, SortDir) pairs.
Renames a column in the pipeline.
Materializes the lazy plan into a DataFrame.
Only rows where every key column matches in both left and right are included in the output. Right key columns are not duplicated.
Example: let lf = DataFrame.lazy(orders_df) |> DataFrame.inner_join(products_df, ["product_id"]) let result = DataFrame.collect(lf) -- Only orders that reference a known product_id are included
Rows in the left frame that have no match in the right frame are still included, with all right-only columns set to Null (NullableXxxCol). Right columns that are join keys are not duplicated.
Example: let lf = DataFrame.lazy(orders_df) |> DataFrame.left_join(customers_df, ["customer_id"]) let result = DataFrame.collect(lf) -- Every order row is present; customer_name is Null for unknown customers
The mirror image of left_join. Every row in right appears in the output; rows with no match in the left frame get Null for every left-only column.
Example: let lf = DataFrame.lazy(transactions_df) |> DataFrame.right_join(reference_df, ["code"]) let result = DataFrame.collect(lf) -- Every reference row is included; transaction cols are Null for unmatched codes
The union of left_join and right_join: every row from both the left and right frames appears in the output. Rows with no match on the other side get Null for all columns from that side.
Example: let lf = DataFrame.lazy(employees_df) |> DataFrame.outer_join(departments_df, ["dept_id"]) let result = DataFrame.collect(lf) -- All employees and all departments appear; Nulls where there is no match
Default CSV write options: comma delimiter, double-quote, with header.
Parse a CSV string into a DataFrame. Infers column types with widening.
Serialize a DataFrame to a CSV string. Uses default options.
Serialize a DataFrame to a CSV string with custom options.
Parse a JSON string (array of objects) into a DataFrame.
Serialize a DataFrame to a JSON string (array of objects).
Groups a DataFrame by the specified columns.
Aggregates a GroupedFrame using the given expressions. Returns a new DataFrame.
Count frequency of each distinct value. Returns DataFrame with columns [col_name, 'count'] sorted by count desc.
Returns summary statistics for each column.
Z-score normalize an IntCol or FloatCol. Returns FloatCol.
Min-max normalize a column to [0, 1]. Returns FloatCol.
Columns in the result: "column", "type", "count", "mean", "std", "min", "p25", "median", "p75", "max". Non-numeric columns get NullVal for every numeric stat.
Example: let df = DataFrame.make_df([IntCol("x", [1,2,3,4,5])]) let d = DataFrame.summarize(df) -- d has 1 row: column="x", type="Int", count=5, mean=3.0, ...
Selects n rows at evenly-spaced positions across the DataFrame. Returns the full DataFrame unchanged when n >= row_count(df).
Example: let df = DataFrame.make_df([IntCol("v", [0,1,2,3,4,5,6,7,8,9])]) let s = DataFrame.sample(df, 3) -- picks rows 0, 3, 6 -> IntCol("v", [0, 3, 6])
ratio is the fraction of rows placed in the training set (0.0 < ratio < 1.0). The first floor(row_count * ratio) rows become the training set; the remainder become the test set. Row order is preserved.
Example: let (train, test) = DataFrame.train_test_split(df, 0.8) -- 80% of rows → train, 20% → test
Returns a FloatCol. Errors on other column types.
Example: let c = FloatCol("price", [1.0, 2.0, 3.0]) let c2 = DataFrame.col_add_float(c, 10.0) -- Ok(FloatCol("price", [11.0, 12.0, 13.0]))
Returns a FloatCol. Errors on other column types.
Example: let c = IntCol("qty", [1, 2, 3]) let c2 = DataFrame.col_mul_float(c, 2.5) -- Ok(FloatCol("qty", [2.5, 5.0, 7.5]))
- Int + Int → IntCol (named after col1)
- Float + Float, Int + Float, Float + Int → FloatCol (named after col1)
Errors if the columns have different lengths or are non-numeric.
Example: let a = IntCol("a", [1, 2, 3]) let b = IntCol("b", [10, 20, 30]) let sum = DataFrame.col_add_col(a, b) -- Ok(IntCol("a", [11, 22, 33]))
Remove rows that have a null in any of the specified columns.
Replace null values in a nullable column with fill_val. Errors on type mismatch.
Apply fill_null to a named column in a DataFrame, replacing it in-place.
Forward-fill nulls in a nullable column: propagate the last non-null value downward.
Backward-fill nulls in a nullable column: propagate the next non-null value upward.
a new column named out_col. No partitioning (operates over all rows).
id_vars — columns kept as-is in every output row. value_vars — columns whose names become values of var_col and whose values become values of val_col. One output row is produced per (input row × value_var).
Example: melt(df, ["id"], ["jan","feb","mar"], "month", "value") turns: id | jan | feb | mar into: id | month | value (3 output rows per original row)
index_col — column whose distinct values become output rows. cols_col — column whose distinct values become new output column names. vals_col — column whose values fill the output cells. Missing (index, col) combinations get NullVal.
Example: pivot(df, "region", "product", "revenue") turns: region | product | revenue into: region | A | B | ... (one column per distinct product)
Render a DataFrame as an HTML table string. The output is detected automatically by the March notebook and displayed as a styled table rather than raw text.