TOP 6 DATA SCIENCE AND ANALYTICS TRENDS FOR 2021 - How the Data Cloud accelerates machine learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TOP 6 DATA SCIENCE AND ANALYTICS TRENDS FOR 2021 How the Data Cloud accelerates machine learning EBOOK
3 Introduction 4 Shift towards predictive and prescriptive analytics continues 5 Trend #1: Easy-to-use ML tools empower data analysts and data scientists AutoML AI Services via an API 6 Trend #2: A consolidated platform closes the gap between analytics and ML 7 Trend #3: Snowflake’s Data Cloud expands access to new data 8 Trend #4: Data engineering tools decrease drain on data scientist’s time 9 Trend #5: New generation of distributed training frameworks offers compelling alternative to Spark 10 Trend #6: Continuous releases provide new options for ML libraries, tools, and frameworks 11 Accelerate your machine learning in 2021 12 About Snowflake
INTRODUCTION Data science has evolved dramatically over the last 10 years. However, In this ebook, you will learn how: very few organizations have experienced the full business impact • E asy-to-use ML tools and consolidated data platforms empower or competitive advantage from their advanced analytics, despite data analysts and bridge the gap between analytics and ML significant investments in data science and machine learning (ML). The • S nowflake’s Data Cloud can expand data access and data reason? Many of the tools needed to scale ML are too complicated, sharing through a secure ecosystem with access to ready-to-use and necessary skill sets are in short supply. But change is now afoot. third-party data Technology advancements in 2021 will significantly impact the way • D ata engineering tools remove the burden of data prep for data in which data scientists and data analysts work. In 2021, six trends scientists and make repurposing existing work easy have the potential to accelerate ML and move organizations from descriptive and diagnostic analytics (explaining what happened and • N ew distributed training frameworks offer an alternative superior to Spark while delivering up to 2,000x faster why) towards predictive and prescriptive analytics that forecast what performance will happen and also provide powerful pointers on how to change the future. • R apid advancements in ML libraries, tools, and frameworks demonstrate the need for a solution that future-proofs data science and ML investments 3
CHAMPION GUIDES SHIFT TOWARDS PREDICTIVE AND PRESCRIPTIVE ANALYTICS CONTINUES In 2021, the field of data science is poised to Recent studies lend credence to this common Remarkably, advancements made in 2020 point to six finally live up to the high expectations that sentiment. A report from MIT Sloan Management exciting trends for data science and ML in 2021. New many organizations have held for years. Over Review and Boston Consulting Group found that tools and technologies have emerged—and continue only 10% of organizations are seeing significant to be released every month—that accelerate the work the last 10 years, huge investments have financial benefits from their investments in AI,1 and of data scientists while also empowering data analysts been made in data science and ML, guided VentureBeat AI reported that 87% of data science to move beyond descriptive analytics and conduct by the hope that it would transform the way projects never make it into production.2 And in 2019, light data science and ML. companies do business. However, many Gartner pointed to one of the bigger challenges Underlying this acceleration is the cloud. Data organizations continue to feel challenged to organizations face: “Through 2020, 80% of AI scientists and data analysts benefit from cloud drive real business impact with analytics. projects will remain alchemy, run by wizards whose technologies that provide virtually unlimited amounts talents will not scale in the organization.”3 of compute resources. In addition, the cloud enables Organizations invest in data science because it the elimination of data silos by consolidating data promises to bring competitive advantages, but many lakes, data warehouses, and data marts for fast, of the tools and skill sets needed to scale ML have secure, and easy data sharing and analysis in a been missing or in short supply. Data scientists single location. continue to be a sought-after and expensive resource, In short, data is transforming into an actionable asset, and their valuable efforts tend to be relegated to and new tools are using that reality to move the needle time-consuming tasks such as data selection and data with ML. As a result, organizations are on the brink preparation. Conversely, data analysts are in abundant of mobilizing data to not only predict the future but supply in most companies and already know how to to also increase the likelihood of certain outcomes address business problems directly, but they lack the through prescriptive analytics. technical background required to make the jump from analytics to data science to build their own ML models. Here are six trends that will shape data science in 2021 and continue the evolution of analytics towards ML. 4
CHAMPION GUIDES TREND #1: EASY-TO-USE ML TOOLS EMPOWER DATA ANALYSTS AND DATA SCIENTISTST Most organizations employ an abundance of But AutoML isn’t just for analysts. Thanks to its power, or image recognition, AI services simply plug-and-play data analysts and a limited number of data AutoML is making a huge difference for data scientists into an application through an API, which requires no scientists, due in large part to the limited by addressing the busy work that can take up to 80% involvement from a data scientist. of their time, according to Harvard Business Review: Amazon provides a variety of fully managed AI supply and high costs associated with data loading, selecting, preparing, and cleaning data.4 scientists. Since analysts lack the data science services, including Amazon Lex, Polly, Rekognition, By eliminating these time-consuming data chores, Forecast, and Translate.5 To illustrate the value, expertise required to build ML models, data AutoML increases data scientists’ productivity and Rekognition allows an image to be sent from an app scientists have become the de facto bottleneck provides more time to conduct analyses. Additionally, through the API to Amazon; the AI service then for broadening the use of ML. human errors found in manual modeling processes returns a classification and description of what the are eliminated, which improves accuracy. image is. These types of utilities not only save time However, new and improved ML tools are In addition, the one historical flaw of AutoML was and effort, but they also free up data scientists to opening the floodgates on ML by automating that it was seen as a black box, but that challenge focus on building and training models that are highly the technical aspects of data science. Data has been solved. AutoML services now provide customized to their business, rather than re-creating analysts are empowered with access to transparency and explanations for their models, commonly used services. powerful models without needing to manually which is key for auditing and detecting bias. For data AutoML tools and AI services lower the barrier to build them. Specifically, automated machine scientists, AutoML transforms how quickly they can entry for ML, so almost anyone can now access learning (AutoML) and AI services via APIs are build and test multiple models simultaneously. and use data science without requiring an academic removing the need to manually prepare data In 2020, AutoML tools from providers such as background. However, the true power of these tools and then build and train models. DataRobot, Dataiku, and H2O saw significant is unleashed when they are integrated seamlessly advancements, and new solutions were introduced with your existing technologies. With Snowflake such as Amazon SageMaker Autopilot. Partner Connect, organizations can receive faster AUTOML insights from their data through pre-built integrations AutoML tools are aptly named: They automate the AI SERVICES VIA AN API between Snowflake and technology partners’ tasks associated with developing and deploying ML Another approach growing in popularity is AI services, products. Snowflake Partner Connect makes it fast models. Their development is game changing for both which are ready-made models available through APIs. and easy to try new ML tools and services and then data scientists and data analysts. By automating tasks Rather than use your own data to build and train adopt those that best meet your business needs. traditionally done by data scientists, AutoML tools models for common activities, organizations access enable data analysts to access models through an pre-trained models that accomplish specific tasks. entirely graphical interface—without requiring a data Whether an organization needs natural language scientist’s involvement or the need to write code. processing (NLP), automatic speech recognition (ASR), 5
CHAMPION GUIDES TREND #2: A CONSOLIDATED PLATFORM CLOSES THE GAP BETWEEN ANALYTICS AND ML Everyone knows data silos exist within and Much like organizational silos, analytics silos thwart Conversely, data scientists have the ability to build across organizations. However, few realize that collaboration and integration opportunities between ML models that not only predict but also influence these siloes also take the form of “analytics data scientists and data analysts. This situation results business outcomes. However, they are not as well in organizations missing out on the combined power of versed in the dynamic and fluctuating business siloes,” particularly between data scientists these two teams, which is exponentially stronger than environment as data analysts are. Sisu describes data and data analysts. These analytics silos have simply the sum of the two parts. scientists as “narrow-and-deep workers,” and their formed as a result of the different ways the focus frequently results in organizations trying to For example, data analysts leverage data to provide two roles work and their respective skill sets. focus data science efforts at known problems (often key business metrics and answer questions around Data silos are just one part of the difference: why something happened. According to Sisu, data uncovered by data analysts) to maximize their value Data scientists and data analysts use different and contributions rather than potentially wasting time analysts’ superpower is speed, and they use it to data (raw versus processed), data sources (data and effort on the unknown.7 analyze data sets quickly and work with business lakes versus warehouses and marts), languages stakeholders to uncover potential insights.6 While Snowflake’s Data Cloud provides the tools to (Python and Spark versus SQL), and tools (ML their goals are to help companies monetize market help deliver stronger outcomes and scale. Through versus BI). opportunities and improve competitive advantage, Snowflake, analytics silos are eliminated. The same most of the work data analysts do is backwards consistent, governed metrics and data are available for looking because they lack the data science skills both analytics and ML through a shared feature store necessary to build predictive ML models. Instead, data and reuse of data engineering pipelines. When data analysts rely on BI tools whose dashboards have built- science insights are shared in Snowflake’s platform, in limitations. While they can use data to understand data analysts can access and incorporate them into what has already happened, it’s challenging for data dashboards and analysis, thus broadening the scope analysts to be proactive and explore data deeply to of impact of the models the data science team builds. figure out what will happen and how to influence it. 6
CHAMPION GUIDES TREND #3: SNOWFLAKE’S DATA CLOUD EXPANDS ACCESS TO NEW DATA In an IDC report sponsored by Seagate, IDC 1 Snowflake’s Data Cloud is an ecosystem where of business partners, suppliers, and customers, or reports that by 2025, it expects 175 zettabytes Snowflake customers, partners, data providers, from third-party data providers and data service of data to be created worldwide each year,8 and data service providers connect to their own providers. Snowflake Data Marketplace removes data and seamlessly share and consume data and the arduous processes involved in locating the which is approximately four times the amount data services shared by other users. Underpinned right data sets, signing contracts with vendors, produced in 2020 according to the World by Snowflake’s platform, the Data Cloud eliminates and managing the data to make it compatible with Economic Forum.9 The explosion of data barriers presented by siloed data and enables internal data. Instead, data scientists and data produced by technologies such as IoT, social organizations to unify and connect to a single analysts can source new data with ease. In addition media, and mobile devices is opening up vast copy of data. In addition, the Data Cloud is a to Snowflake Data Marketplace, organizations opportunities for data-driven insights into seamless way to derive value from rapidly growing can use private data exchanges to share data with every area of inquiry. commercialized data sets with fast, easy, and trusted partners, suppliers, vendors, and customers governed access. through Snowflake Secure Data Sharing. Of course, it’s virtually impossible for any organization to produce or collect all the data needed to uncover 2 Empowering the Data Cloud is Snowflake Secure External data is available and accessible to all Data business and competitive trends. Increasingly, the Data Sharing, which removes traditional data Cloud users with just a few clicks. Once it’s in the ability to share and join data sets, both within and transfer barriers. With Snowflake, data is generally Data Cloud, data is ready to be shared and consumed. across organizations, is viewed as the best way never copied and transmitted. Instead, users There’s no sending of CSV files or manual version to derive more value from data. That’s why data can share live data from its original location. control. Data scientists can enrich models with scientists and data analysts are continually on the Those granted access simply reference data in a seamless access to almost-unlimited data on any topic, hunt for more data to supplement their ML models controlled and secure manner without latency including real-time and evolving circumstances. For and analyses with external data to improve the or contention from concurrent users. Because example, in 2020, COVID data sets were universally in accuracy of results. changes to data are made to a single version, high demand across all sectors and industries because data remains up to date for all consumers, which organizations needed to analyze the impact of the Snowflake enables secure, governed, compliant, and ensures data models are always using the latest virus on an hour-by-hour basis. seamless access to third-party data in three ways. version. 3 Snowflake Secure Data Sharing is the technology foundation for Snowflake Data Marketplace, which serves as a single location to access live, ready-to-query data. Secure, governed data can be shared with, and received from, an ecosystem 7
CHAMPION GUIDES TREND #4: DATA ENGINEERING TOOLS DECREASE DRAIN ON DATA SCIENTISTS’ TIME Data engineering tasks require an inordinately Finally, direct integrations between data engineering Data engineering tools not only enable reuse of large amount of a data scientist’s attention. tools and ML tools are starting to bring these two prepared data by anyone within an organization, The time commitment for data preparation worlds together. For example, DataRobot acquired but they can leverage the scalability and efficiency Paxata to add data preparation tools alongside its of Snowflake to provide the processing. Snowflake varies. It can range from 45%, according to the AutoML offering,12 and Alteryx is shifting its focus works with various partners that provide feature store Anaconda “2020 State of Data Science” survey from data prep towards an automated, assisted ML solutions to ensure data scientists and data analysts reported by Datanami,10 to 80%, according modeling offering.13 In addition, Amazon SageMaker can reuse consistent features. to a survey conducted by CrowdFlower launched two new services: Data Wrangler,14 to and reported by Forbes,11 but there is little accelerate data prep for ML, and Pipelines,15 as a disagreement among data scientists that continuous integration and continuous delivery (CI/ collecting, organizing, and cleaning data are CD) service for ML. the least enjoyable tasks they undertake. Data scientists are also benefiting from “feature Minimizing this burden is paramount not only stores,” which make it easy to repurpose existing work. for keeping data scientists productive and For example, once a data scientist has converted raw data into a metric (for example, “cost of goods sold”), happy but also for broadening access to ML. this universal metric can be found quickly and used by everyone else for quick analysis against that data. Not only does this practice save data scientists time and effort, but it also reinforces BI metrics, maintains data governance, and ensures there are no discrepancies across the work done by data analysts and data scientists. 8
CHAMPION GUIDES TREND #5: NEW GENERATION OF DISTRIBUTED TRAINING FRAMEWORKS OFFERS COMPELLING ALTERNATIVE TO SPARK Data scientists are always looking for strategic One approach that’s gaining attention is Dask, a The impact of these distributed training frameworks ways to inject efficiency into training and distributed training framework built in Python.16 Dask is already being seen in the real world. Walmart uses deploying models. Recently, a new generation is designed to enable data scientists to improve model RAPIDS with Dask and XGBoost (an ML algorithm) accuracy faster. Data scientists can do everything in for its data analytics and ML, and NVIDIA reports that of distributed training engines has surfaced Python end to end, which means they no longer need Walmart has found that “one GPU server requires that delivers on that goal by providing to convert their code to execute in Spark. The result only four percent of the time needed to run the same tremendous speed and performance gains is reduced complexity and increased efficiency. forecasting models vs a 20-node CPU server.”19 That over Apache Spark. translates to Walmart running models in four hours Another open source Python framework is RAPIDS, that previously took several weeks using CPUs. which is built on top of Dask.17 RAPIDS optimizes compute time and speed by providing data pipelines While organizations are thinking strategically about and executing data science code entirely on graphics training frameworks, some have run into barriers in processing units (GPUs) rather than CPUs. Saturn the past. Today, new technologies are unlocking what’s Cloud recently compared RAPIDS to Spark and possible and demonstrating how much faster things discovered that model training with RAPIDS took one can be when everything is done directly with Python. second on a 20-node GPU cluster, while Spark took By eliminating the need to convert models into Spark, 37 minutes on a similarly priced 20-node CPU cluster. organizations are reducing complexity and increasing Saturn Cloud concluded that RAPIDS enables 2,000x efficiency. And it’s easy to try different distributed faster processing using GPUs while costing a fraction training frameworks on Snowflake’s platform to find of the price.18 what works best. 9
CHAMPION GUIDES TREND #6: CONTINUOUS RELEASES PROVIDE NEW OPTIONS FOR ML LIBRARIES, TOOLS, AND FRAMEWORKS The field of data science is evolving rapidly. That’s why it’s important to select a platform that In addition to the underlying architecture, Snowflake Not only are new ML and AI developments is vendor-, framework-, and algorithm-agnostic. By supports data science in a variety of ways. released every month, but new startups, tools, choosing a future-proof platform, you ensure that • nowflake’s External Functions allow any S upcoming ML tools will continue to work seamlessly third-party, hosted, or custom ML service to be and solutions emerge regularly. With the rapid with the platform you have. After all, the last thing accessed easily using SQL. pace of innovation occurring in this space, you want to do is re-platform in order to use the next • ecognizing that various teams may prefer R it’s imperative not to get locked into using a generation of tools. languages other than SQL, Snowpark extends single tool. language support for Java, Scala, and, soon, What makes Snowflake’s platform unique is its modern architecture. Designed with separate, but logically Python. Snowpark allows data scientists to write code in their language of choice using familiar integrated, compute and storage, Snowflake eliminates programming concepts, such as DataFrames, and the manual cluster-building efforts that other systems then execute data preparation and workloads must perform to make separate layers work together. directly on Snowflake. As a result, Snowflake provides a multi-cluster, shared data architecture that provides nearly infinite • J ava user-defined functions (UDF) are supported to enable trained models to run within Snowflake. scalability, instant elasticity, and extremely high levels That means models built and trained in an ML of concurrency to power the Data Cloud. partner’s technology can be brought into and run directly on Snowflake resources. 10
CHAMPION GUIDES ACCELERATE YOUR MACHINE LEARNING IN 2021 It’s remarkable how quickly data science has Today, a modern platform is a necessity if you want With Snowflake, limitations on data science become mainstream. In the last 10 years, to analyze and share data quickly and scalably with are removed. Are you ready to accelerate your companies have shifted their focus from security and governance built in. Snowflake provides machine learning? an architecture that enables data consolidation, reporting and historical analysis to conducting efficient data preparation, and an extensive partner data science with advanced mathematical ecosystem. Your data is mobilized, which allows you to models and ML. The cloud changed everything. benefit immediately from new trends in data science With the ability to inexpensively collect and and ML. store more and more data came the need to build data models powered by ML. 11
ABOUT SNOWFLAKE Snowflake delivers the Data Cloud—a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data application development, and data sharing. Join Snowflake customers, partners, and data providers already taking their businesses to new frontiers in the Data Cloud. snowflake.com © 2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s). CITATIONS 1 on.bcg.com/2Kh3grW 5 amzn.to/3sv5qFx 9 bit.ly/3ss1tld 13 zdnet.com/article/where-is-alteryx-heading 17 rapids.ai bit.ly/3qpSFKQ 2 6 bit.ly/2XJOa1u 10 bit.ly/3oQvitn 14 aws.amazon.com/sagemaker/data-wrangler 18 bit.ly/3srNbRv gtnr.it/2N4FZul 3 7 bit.ly/3swABAn 11 bit.ly/3qpT9AI 15 aws.amazon.com/sagemaker/pipelines 19 blogs.nvidia.com/blog/2019/07/11/walmart-nvidia 4 bit.ly/39Cnaq8 8 bit.ly/2LURiog 12 tcrn.ch/2PFKZDf 16 dask.org
You can also read