DropDuplicateRows / Manipulation Layer

Remove duplicate rows based on specified columns. Similar to pandas' drop_duplicates() or R's distinct().

Common applications:

  • Data deduplication (removing identical records)
  • Transaction uniqueness (eliminating double entries)
  • Event log cleaning (removing duplicate logs)
  • Contact list deduplication (unique entries)
  • Time series cleaning (removing repeated measurements)
  • Survey response validation (unique submissions)
  • Data quality assurance (ensuring unique records)
  • Dataset normalization (removing redundancy)

Example: From records [A,1], [B,2], [A,1], [C,3], keeping first occurrence results in [A,1], [B,2], [C,3]

Table
0
0
Table

Select

[column, ...]

Columns to check for duplicates. Examples:

  • Key identifiers for unique entities
  • Combination of fields defining uniqueness
  • Relevant attributes for deduplication
  • Matching criteria columns If empty, compares all columns.
First

Strategy for handling multiple occurrences of identical rows. Determines which duplicate record to retain.

First ~

Retain earliest occurrence of duplicate rows. Useful for:

  • Preserving original entries
  • First-touch attribution
  • Initial occurrence tracking
Last ~

Retain most recent occurrence of duplicate rows. Suitable for:

  • Latest state retention
  • Most recent update keeping
  • Final value preservation
None ~

Remove all duplicate rows including originals. Used for:

  • Strict uniqueness enforcement
  • Complete duplicate elimination
  • Unique-only analysis
Any ~

Keep arbitrary occurrence of duplicate rows. Appropriate for:

  • Performance optimization
  • When order doesn't matter
  • Random sampling from duplicates
false

Preserve original row order after deduplication:

  • false (default): Faster processing, may reorder rows
  • true: Maintains original sequence, slower performance Note: Not supported for DataFrame containing List type columns