ZinkML Graph Dataflow Studio

Overview

The Graph Dataflow Studio is ZinkML's integrated environment for data preparation, processing, and modeling. This comprehensive interface allows users to create visual data pipelines through an intuitive graph-based system.

Video Tutorial

For a visual guide on the Dataflows, please watch our step-by-step tutorial:

Table of Contents

  1. Getting Started
  2. Loading Tables
  3. Data Processing
  4. Modeling Operations
  5. Execution Management
  6. Dataflow Operations
  7. Deployment
  8. Dataflow Management

Getting Started

Initial Setup

  1. Navigate to 'Graph Dataflow Studio'
  2. Create New Dataflow:
    a. Enter unique dataflow name
    b. Click 'Create New Dataflow'
    c. Access blank canvas workspace
    

Interface Overview

The studio interface consists of:

  • Left Panel: Data, Core, and Model tabs
  • Central Canvas: Dataflow workspace
  • Top Panel: Execution logs and results

Loading Tables

Data Tab Navigation

  1. Access left vertical tab structure:

    - Data Tab: Available datasets and tables
    - Core Tab: Processing operators
    - Model Tab: Machine learning algorithms
    
  2. Table Access Options:

    • Owned datasets
    • Shared datasets
    • Public datasets
    • Search functionality
    • Filter options

Table Loading Process

  1. Locate desired table in Data tab

  2. Implementation:

    a. Drag and drop selected table node to canvas
    b. Click 'Execute' to load data
    c. Wait for execution to complete
    
  3. Data Visualization:

    • Table View: Raw data examination
    • Plots View: Visual data analysis
    • Sample data schema and statistics preview

Data Processing

Creating Data Pipelines

  1. Operator Selection:

    a. Access Core tab
    b. Browse available operators
    c. Search for specific functions
    
  2. Node Connection Process:

    a. Add processing nodes to canvas
    b. Connect output sockets to input sockets
    c. Create logical data flow
    d. Verify connections
    
  3. Data Flow Architecture:

    • Source node → Edge → Target nodes
    • Multiple target connections possible
    • Branching workflows supported

Execution and Validation

  1. Processing Steps:

    a. Click 'Execute' button
    b. Monitor processing status
    c. Review execution logs
    d. Verify output results
    
  2. Parameter Management:

    • Adjust node parameters
    • Re-run dataflow after parameter changes
    • New version for each completed execution

Modeling Operations

Model Implementation

  1. Model Node Addition:

    a. Select from Model tab
    b. Add to canvas
    c. Connect to data pipeline
    
  2. Parameter Configuration:

    • Default Settings
    • Custom Parameters. Example:
      - Cross-validation settings
      - Train-test split settings
      - Learning rate
      - Batch size
      - Optimization settings
      
    • Grid Search Options:
      - Parameter ranges
      - Cross-validation settings
      - Train-test split settings
      - Search strategies
      

Execution Management

Execution Table Details

ColumnDescriptionExample
VersionDataflow iterationv1.2.3
NodesActive nodes count15 nodes
Rows ProcessedData volume1M rows
UserExecution initiatorjohn@zinkml.com
StatusCurrent stateRunning/Complete/Failed/ Queued
Deployment ActionsAvailable operationsDeploy/Predict

Dataflow Operations

Check the video tutorial for detailed information

  • Add nodes
  • Select graph portions
  • Delete nodes and edges
  • Replicate selected portion of graph
  • Change parameters for respective nodes
  • Check all nodes on the left tab
  • See collaborators (users) with access to this dataflow
  • Download the dataflow
  • Reposition the dataflow in ‘Pretty format’ on the canvas
  • Refresh to pull the latest version of this dataflow
  • Fork the Dataflow: copy the dataflow as a new dataflow and make edits in the new copied dataflow
  • Node Operations
    • Visualize input tables
    • Visualize output tables
    • Right click to:
      • Check error reason (if the execution failed)
      • Execute this node only (executes all (and only) nodes required for execution of this node)
      • Replicate this node on the canvas
    • Delete the node

Deployment

Deployment Process

a. Go to Execution ('Runs') table
b. Select Executed version with 'Completed' status
c. Click on 'Deploy' action button
d. Once ‘Deployed’, Click on ‘Predict’ to use the deployed dataflow to process new tables.

Dataflow Management

Dataflow Table Information

FeatureDescription
NameDataflow identifier
VersionCurrent version
Nodes & EdgesStructure details
Latest RunMost recent execution
StatusCurrent state
Updated/CreatedTimestamps
Access StatusPrivate/Public/Shared
ActionsCollaboration tools

Management Options

  • Download dataflow
  • Delete dataflow
  • Collaboration tools
  • Access control

Best Practices

Development Guidelines

  1. Design Principles:

    - Modular design
    - Clear documentation
    - Regular testing
    - Version control
    
  2. Performance Optimization:

    - Resource efficiency
    - Pipeline optimization
    - Caching strategies
    - Error handling
    

Additional Resources