A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.
Which strategy will yield the best performance without shuffling data?
A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.
Show Answer
Correct Answer: A
Explanation:
The requirement is to achieve ~512 MB Parquet files with the best performance *without shuffling*. Any use of repartition, sort, or shuffle-related settings violates that constraint. Option A works because spark.sql.files.maxPartitionBytes controls input split size when reading file-based sources like JSON. With only narrow transformations, Spark preserves the input partitioning through to the write, so the number and size of output Parquet files closely follow the read partitions. All other options either explicitly trigger a shuffle (repartition, sort) or rely on shuffle-only mechanisms (shuffle partitions, AQE advisory size). Therefore, A is the only strategy that aligns with the no-shuffle requirement and provides the best performance.