Interface parity: PlanFrame spark skin vs PySpark DataFrame
This page compares PlanFrame’s PySpark-like interface skin (planframe.spark.SparkFrame) to pyspark.sql.DataFrame.
PlanFrame is always lazy and executes through adapters; there is no Spark execution engine in core.
Legend
- Exact parity: same name + same argument shape + same behavior (modulo “always lazy”).
- Typed-parity: same shape, but restricted to typed-safe subsets.
- Divergence: present but intentionally differs (documented).
- Unsupported: not implemented; should fail loudly.
High-value mapping
| PySpark | PlanFrame spark skin | Status | Notes |
|---|---|---|---|
df["x"] |
__getitem__ |
Typed-parity | Returns a typed Column wrapper over PlanFrame Expr. |
df.x |
__getattr__ |
Typed-parity | Only when x is a schema column. |
select(...) |
select(...) |
Typed-parity | Accepts str, Spark Column, or Expr (with restrictions). |
where(...) |
where(...) |
Typed-parity | Predicate must be Column[bool] or Expr[bool]. |
filter(...) |
filter(...) |
Divergence | Delegates to where (typed). |
withColumn(name, col) |
withColumn(...) |
Typed-parity | Lowered to core with_columns(exprs={...}). |
withColumns(map) |
withColumns(...) |
Typed-parity | Lowered to repeated with_columns. |
withColumnRenamed(a,b) |
withColumnRenamed(...) |
Typed-parity | Lowered to core rename. |
orderBy(...) |
orderBy(...) |
Typed-parity | Lowered to core sort; supports per-key ascending. |
distinct() |
distinct() |
Typed-parity | Lowered to unique(). |
dropDuplicates(subset=...) |
dropDuplicates(...) |
Typed-parity | Lowered to drop_duplicates. |
groupBy(...).agg(...) |
groupBy(...).agg(...) |
Typed-parity | Typed subset of aggregations (functions.sum/mean/min/max/count/n_unique). |
union / unionAll |
union / unionAll |
Typed-parity | Lowered to vstack (UNION ALL semantics). |
unionByName(...) |
unionByName(...) |
Divergence | allowMissingColumns=True is unsupported. |
join(...) |
join(...) |
Typed-parity | Normalizes on/left_on/right_on; supports Spark-ish how aliases. |
hint(...) |
hint(...) |
Divergence | Plan-level hint node; adapters may ignore. |
Unsupported (expected)
- Partition and caching semantics:
repartition,coalesce,cache,persist,unpersist - SQL-string API:
selectExpr - Partition-level sorting:
sortWithinPartitions - Set ops not implemented in core:
intersect,subtract
Generated inventories (method-by-method)
For an exhaustive, generated inventory:
SparkFramesurface:planframe/design/_generated/interface-inventory-spark.md- PySpark methods missing from PlanFrame spark skin (empty if PySpark isn’t installed):
planframe/design/_generated/spark-missing.md
Missing / divergent areas (categorized)
- Partitioning & caching semantics (unsupported):
cache,persist,unpersist,repartition,coalesce,sortWithinPartitions - String-based DSL (typed restriction):
selectExpris only safe to support as a very small subset (column names + optionalASalias). Arbitrary SQL-like expressions remain unsupported. - Set operations:
intersect/subtractcan be provided as typed wrappers only if core has plan-level nodes with well-defined schema behavior; otherwise they should fail loudly. - Spark session/catalog/stateful APIs: expected to remain unsupported in core PlanFrame.