DropDuplicateRows / Manipulation Layer
Remove duplicate rows based on specified columns. Similar to pandas' drop_duplicates() or R's distinct().
Common applications:
- Data deduplication (removing identical records)
- Transaction uniqueness (eliminating double entries)
- Event log cleaning (removing duplicate logs)
- Contact list deduplication (unique entries)
- Time series cleaning (removing repeated measurements)
- Survey response validation (unique submissions)
- Data quality assurance (ensuring unique records)
- Dataset normalization (removing redundancy)
Example: From records [A,1], [B,2], [A,1], [C,3], keeping first occurrence results in [A,1], [B,2], [C,3]
Table
0
0
Table
Select
[column, ...]Columns to check for duplicates. Examples:
- Key identifiers for unique entities
- Combination of fields defining uniqueness
- Relevant attributes for deduplication
- Matching criteria columns If empty, compares all columns.
KeepStrategy
enumStrategy for handling multiple occurrences of identical rows. Determines which duplicate record to retain.
First ~ Last ~ None ~ Any ~
Retain earliest occurrence of duplicate rows. Useful for:
- Preserving original entries
- First-touch attribution
- Initial occurrence tracking
Retain most recent occurrence of duplicate rows. Suitable for:
- Latest state retention
- Most recent update keeping
- Final value preservation
Remove all duplicate rows including originals. Used for:
- Strict uniqueness enforcement
- Complete duplicate elimination
- Unique-only analysis
Keep arbitrary occurrence of duplicate rows. Appropriate for:
- Performance optimization
- When order doesn't matter
- Random sampling from duplicates
MaintainOrder
boolPreserve original row order after deduplication:
false
(default): Faster processing, may reorder rowstrue
: Maintains original sequence, slower performance Note: Not supported for DataFrame containing List type columns