Struct-Bench

A Benchmark for Evaluating Differentially Private Synthetic Data for Structured Datasets Containing Natural Language

Introduction

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise can’t be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises five real-world and two synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution on structured data. The benchmark and the leaderboard will be publicly available at https://struct-bench.github.io.

Struct-Bench Average Benchmarking Results

Baseline Structural Metrics Non-Structural Metrics Downstream Acc ↑
CFG-PR ↑ KND ↓ AM ↓ KNN-Precision ↑ KNN-Recall ↑
IF (ε = 0) 0.8633 0.2614 35.8261 0.2354 0.0573 0.5412
FT (ε = ∞) 0.0768 0.0315 52.6984 0.2141 0.1898 0.4426
DP-FT (ε = 4) 0.0000 0.0020 0.0060 0.0092 0.0279 0.3802
PE (ε = 4) 0.8648 0.1595 40.0860 0.2639 0.0491 0.5442

Analysis

Based on an analysis of the results on Struct-Bench, we present the following observations:
  • No single metric fully describes synthetic data quality. For a single algorithm and dataset, some metrics can be high, while others remain low. This further motivates the need for Struct-Bench, which aggregates many diverse metrics.
  • Existing DP synthetic data generators struggle to learn complicated data structures. All baselines achieve a CFG-PR score below 0.2 on the ICLR dataset, which features more node types and a significantly more intricate graph structure than ShareGPT.
  • DP fine-tuning alone cannot learn structure. At ε = 4, it achieves a CFG-PR of 0 on all of our datasets. Even at ε = ∞, it reaches at best a CFG-PR of 0.53 on ShareGPT, seemingly because ShareGPT has fewer formatting tokens than other datasets (e.g., JSON tags in tabular datasets).
  • PE and IF learn structure at the expense of semantic performance. Although PE and IF reliably capture the data structure, they suffer from poor semantic performance (nearly 0 Recall).