Dataiku Data Science Studio (DSS) is a platform that tries to span the needs of data scientists, data engineers, business analysts, and AI consumers. It mostly succeeds. In addition, Dataiku DSS tries to span the machine learning process from end to end, i.e. from data preparation through MLOps and application support. Again, it mostly succeeds.
The Dataiku DSS user interface is a combination of graphical elements, notebooks, and code, as we’ll see later on in the review. As a user, you often have a choice of how you’d like to proceed, and you’re usually not locked into your initial choice, given that graphical choices can generate editable notebooks and scripts.
During my initial discussion with Dataiku, their senior product marketing manager asked me point blank whether I preferred a GUI or writing code for data science. I said “I usually wind up writing code, but I’ll use a GUI whenever it’s faster and easier.” This met with approval: Many of their customers have the same pragmatic attitude.
Dataiku competes with pretty much every data science and machine learning platform, but also partners with several of them, including Microsoft Azure, Databricks, AWS, and Google Cloud. I consider KNIME similar to DSS in its use of flow diagrams, and at least half a dozen platforms similar to DSS in their use of Jupyter notebooks, including the four partners I mentioned. DSS is similar to DataRobot, H2O.ai, and others in its implementation of AutoML.
Dataiku DSS features
Dataiku says that its key capabilities are data preparation, visualization, machine learning, DataOps, MLOps, analytic apps, collaboration, governance, explainability, and architecture. It supports additional capabilities through plug-ins.
Dataiku data preparation features a visual flow where users can build data pipelines with datasets, recipes to join and transform datasets, plus code and reusable plug-in elements.
Dataiku does quick visual analysis of columns, including the distribution of values, top values, outliers, invalids, and overall statistics. For categorical data, the visual analysis includes the distribution by value, including the count and % of values for each value. The visualization capabilities let you perform exploratory data analysis without resorting to Tableau, although Dataiku and Tableau are partners.
Dataiku machine learning includes AutoML and feature engineering, as shown in the figure below. Each Dataiku project has a DataOps visual flow, including the pipeline of datasets and recipes associated with the project.
For MLOps, the Dataiku unified deployer manages project files’ movement between Dataiku design nodes and production nodes for batch and real-time scoring. Project bundles package everything a project needs from the design environment to run on the production environment.
Dataiku makes it easy to create project dashboards and share them with business users. The Dataiku visual flow is the canvas where teams collaborate on data projects; it also represents the DataOps and provides an easy way to access the details of individual steps. Dataiku permissions control who on the team can access, read, and change a project.
Dataiku provides critical capabilities for explainable AI, including reports on feature importance, partial dependence plots, subpopulation analysis, and individual prediction explanations. These are in addition to providing interpretable models.
DSS has a large collection of plug-ins and connectors. For example, time series prediction models come as a plug-in; so do interfaces to the AI and machine learning services of AWS and Google Cloud, such as Amazon Rekognition APIs for Computer Vision, Amazon SageMaker machine learning, Google Cloud Translation, and Google Cloud Vision. Not all plug-ins and connectors are available to all plans.
Dataiku targets data scientists, data engineers, business analysts, and AI consumers. I went through the Dataiku Data Scientist tutorial, which seems to be the closest match to my skills, and took screen shots as I went.
Dataiku data preparation and visualization
The initial state of the flows in this tutorial reflects having some of the setup, data finding, data cleaning, and joining done by someone else, presumably a data analyst or data engineer. In a team effort, that’s likely. For a solo practitioner, it’s not. Dataiku may support both use cases, but has made a considerable effort to support teams in enterprises.
Clicking into a dataset’s icon in a flow brings it up in a sheet.
Showing the data is useful, but exploratory data analysis is even more useful. Here we are generating a Jupyter notebook for a single dataset, which was in turn created by joining two prepared datasets.
I have to complain a little at this point. All of the prebuilt or generated notebooks I used were written in Python 2, but that’s no longer a valid DSS environment, since Python 2 has (at long last) been deprecated by the Python Software Foundation. I had to edit many notebook cells for Python 3, which was annoying and time-consuming. Fortunately, it was fairly simple: The most frequent fix was to add parentheses around the arguments of the
The generated notebook uses standard Python libraries such as Pandas, Matplotlib, Seaborn, and SciPy to handle data, generate plots, and compute descriptive statistics.
Dataiku machine learning and model assessment
Before I could do anything with the Model Assessment flow zone, I had to add a recipe to check whether a customer’s revenue is over or under a specific barrier variable, which is defined globally. The recipe created the
high_value dataset, which has an additional column for the classification. In general, recipes in a flow (other than data preparation steps that remove rows or columns) do add a column with the new computed values. Then I had to build all the flow outputs reachable from the split step.
Dataiku AutoML, interpretable models, and high-performance models
This tutorial moves on to creating and running an AutoML session with interpretable models, such as Random Forest, rather than high-performance models (just a different initial selection of model choices) or deep learning models (Keras/TensorFlow, using Python code). As it turns out, my Booster Plan Dataiku cloud instance didn’t have a Python environment that could support deep learning, and didn’t have GPUs. Both could be added using a more expensive Orbit plan, which also adds distributed Spark support.
I was restricted to in-memory training with Scikit-learn and custom models on two CPUs, which was fine for exploratory purposes. Most of the feature engineering options in the DSS AutoML model were turned off for the purposes of the tutorial. That was fine for learning purposes, but I would have used them for a real data science project.
Dataiku deployment and MLOps
After finding a winning model in the AutoML session, I deployed it and explored some of the MLOps features of DSS, using Scenarios. The scenario supplied with the flow for this tutorial uses a Python script to rebuild the model, and replace the deployed model if the new model has a higher ROC AUC value. The exercise to test this capability uses an external variable to change the definition of a high-value customer, which isn’t all that interesting, but does make the point about MLOps automation.
Overall, Dataiku DSS is a very good, end-to-end platform for data analysis, data engineering, data science, MLOps, and AI browsing. Its self-service cloud pricing is reasonable, but not cheap; the basis for enterprise pricing is reasonable, although I have no concrete information about its actual enterprise pricing.
Dataiku tries hard to support non-programmers in DSS with a graphical UI and visual machine learning. The visual aspects of the product do generate notebooks with code a programmer can customize, which saves a lot of time.
I’m not totally convinced, however, that non-programming “citizen data scientists” can perform data engineering and data science effectively, even with all of the tools and training that Dataiku supplies. Data science teams need at least one member who can program and at least one member with an intuition for feature engineering and model building, not necessarily the same person. In the worst case, you might have to rely on Dataiku’s consultants for guidance.
It’s certainly worth doing a free evaluation of Dataiku DSS. You can use either the downloaded Community Edition (free forever, three users, files or open source databases) or the 14-day hosted cloud trial (five users, two CPUs, 16 GB RAM, 100 GB plus BYO cloud storage).
Hosted self-service cloud plans: Ignition plan: $348/month, 1 CPU, 8 GB RAM, 100 GB cloud storage, file uploads, DSS plus Python, one user. Booster plan: $1,128/month, 2 CPUs, 16 GB RAM, 100 GB plus BYO cloud storage, files plus databases plus apps, DSS plus Python plus Snowflake, five users. Orbit plan: $1,700/month and up, adds Spark, scalable resources, 10 users.
On-premises/own cloud plans: Community Edition: free, up to three users. Discover Edition (up to five users), Business Edition (up to 20 users), Enterprise Edition: Subscription-based pricing depends on the license type, the number of users, and the type of users (designers vs. explorers).
Dataiku Cloud; Linux x86-x64, 16 GB RAM; macOS 10.12+ (evaluation only); Amazon EC2, Google Cloud, Microsoft Azure, VirtualBox, VMware. 64-bit JDK or JRE, Python, R. Supported browsers: latest Chrome, Firefox, and Edge.
Copyright © 2021 IDG Communications, Inc.