ZinkML Graph Dataflow Studio

Overview

The Graph Dataflow Studio is ZinkML's integrated environment for data preparation, processing, and modeling. This comprehensive interface allows users to create visual data pipelines through an intuitive graph-based system.

Video Tutorial

For a visual guide on the Dataflows, please watch our step-by-step tutorial:

Getting Started
Loading Tables
Data Processing
Modeling Operations
Execution Management
Dataflow Operations
Deployment
Dataflow Management

Getting Started

Initial Setup

Navigate to 'Graph Dataflow Studio'

Create New Dataflow:

a. Enter unique dataflow name
b. Click 'Create New Dataflow'
c. Access blank canvas workspace

Interface Overview

The studio interface consists of:

Left Panel: Data, Core, and Model tabs
Central Canvas: Dataflow workspace
Top Panel: Execution logs and results

Loading Tables

Access left vertical tab structure:

- Data Tab: Available datasets and tables
- Core Tab: Processing operators
- Model Tab: Machine learning algorithms

Table Access Options:
- Owned datasets
- Shared datasets
- Public datasets
- Search functionality
- Filter options

Table Loading Process

Locate desired table in Data tab

Implementation:

a. Drag and drop selected table node to canvas
b. Click 'Execute' to load data
c. Wait for execution to complete

Data Visualization:
- Table View: Raw data examination
- Plots View: Visual data analysis
- Sample data schema and statistics preview

Data Processing

Creating Data Pipelines

Operator Selection:

a. Access Core tab
b. Browse available operators
c. Search for specific functions

Node Connection Process:

a. Add processing nodes to canvas
b. Connect output sockets to input sockets
c. Create logical data flow
d. Verify connections

Data Flow Architecture:
- Source node → Edge → Target nodes
- Multiple target connections possible
- Branching workflows supported

Execution and Validation

Processing Steps:

a. Click 'Execute' button
b. Monitor processing status
c. Review execution logs
d. Verify output results

Parameter Management:
- Adjust node parameters
- Re-run dataflow after parameter changes
- New version for each completed execution

Modeling Operations

Model Implementation

Model Node Addition:

a. Select from Model tab
b. Add to canvas
c. Connect to data pipeline

Parameter Configuration:

Default Settings

Custom Parameters. Example:

- Cross-validation settings
- Train-test split settings
- Learning rate
- Batch size
- Optimization settings

Grid Search Options:

- Parameter ranges
- Cross-validation settings
- Train-test split settings
- Search strategies

Execution Management

Execution Table Details

Column	Description	Example
Version	Dataflow iteration	v1.2.3
Nodes	Active nodes count	15 nodes
Rows Processed	Data volume	1M rows
User	Execution initiator	john@zinkml.com
Status	Current state	Running/Complete/Failed/ Queued
Deployment Actions	Available operations	Deploy/Predict

Dataflow Operations

Check the video tutorial for detailed information

Add nodes
Select graph portions
Delete nodes and edges
Replicate selected portion of graph
Change parameters for respective nodes
Check all nodes on the left tab
See collaborators (users) with access to this dataflow
Download the dataflow
Reposition the dataflow in ‘Pretty format’ on the canvas
Refresh to pull the latest version of this dataflow
Fork the Dataflow: copy the dataflow as a new dataflow and make edits in the new copied dataflow
Node Operations
- Visualize input tables
- Visualize output tables
- Right click to:
  - Check error reason (if the execution failed)
  - Execute this node only (executes all (and only) nodes required for execution of this node)
  - Replicate this node on the canvas
- Delete the node

Deployment

Deployment Process

a. Go to Execution ('Runs') table
b. Select Executed version with 'Completed' status
c. Click on 'Deploy' action button
d. Once ‘Deployed’, Click on ‘Predict’ to use the deployed dataflow to process new tables.

Dataflow Management

Dataflow Table Information

Feature	Description
Name	Dataflow identifier
Version	Current version
Nodes & Edges	Structure details
Latest Run	Most recent execution
Status	Current state
Updated/Created	Timestamps
Access Status	Private/Public/Shared
Actions	Collaboration tools

Management Options

Download dataflow
Delete dataflow
Collaboration tools
Access control

Best Practices

Development Guidelines

Design Principles:

- Modular design
- Clear documentation
- Regular testing
- Version control

Performance Optimization:

- Resource efficiency
- Pipeline optimization
- Caching strategies
- Error handling