Learning Library

← Back to Library

Leveraging Open Source in Watson X

7m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

IBM is extending its long‑standing open‑source heritage to Watson X, using community‑driven tools to deliver the best AI models and innovation.
Watson X’s model‑training and validation layer is built on the open‑source CodeFlare project, which abstracts scaling, queuing and deployment by integrating Ray, Kubernetes (OpenShift) and PyTorch.
CodeFlare automatically provisions clusters, queues jobs, scales resources up or down when needed, and tears down the environment after training, freeing data scientists from infrastructure concerns.
The platform represents and runs models with PyTorch, leveraging its tensor operations, GPU support and distributed‑training capabilities for large foundation models.
Complementary open‑source components also handle model tuning/inferencing and data gathering/analytics, completing an end‑to‑end AI lifecycle in Watson X.

Sections

00:00:00 IBM Watson X Open‑Source Stack - The segment explains how IBM’s Watson X platform uses open‑source tools—especially the CodeFlare project integrating Ray, Kubernetes (OpenShift) and PyTorch—to streamline model training, tuning/inferencing, and data analytics for large foundation models.

Full Transcript

# Leveraging Open Source in Watson X **Source:** [https://www.youtube.com/watch?v=Cgiqx0pJuLo](https://www.youtube.com/watch?v=Cgiqx0pJuLo) **Duration:** 00:07:31 ## Summary - IBM is extending its long‑standing open‑source heritage to Watson X, using community‑driven tools to deliver the best AI models and innovation. - Watson X’s model‑training and validation layer is built on the open‑source CodeFlare project, which abstracts scaling, queuing and deployment by integrating Ray, Kubernetes (OpenShift) and PyTorch. - CodeFlare automatically provisions clusters, queues jobs, scales resources up or down when needed, and tears down the environment after training, freeing data scientists from infrastructure concerns. - The platform represents and runs models with PyTorch, leveraging its tensor operations, GPU support and distributed‑training capabilities for large foundation models. - Complementary open‑source components also handle model tuning/inferencing and data gathering/analytics, completing an end‑to‑end AI lifecycle in Watson X. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Cgiqx0pJuLo&t=0s) **IBM Watson X Open‑Source Stack** - The segment explains how IBM’s Watson X platform uses open‑source tools—especially the CodeFlare project integrating Ray, Kubernetes (OpenShift) and PyTorch—to streamline model training, tuning/inferencing, and data analytics for large foundation models. ## Full Transcript

0:00IBM has a rich history of both 0:03contributing to open source and 0:04leveraging open source in its offerings 0:07and IBM continues that tradition with 0:09Watson X 0:10what is Watson X well that's our new 0:13Enterprise platform for AI and data and 0:16why do we leverage open source in Watson 0:18X well open source gives us the best AI 0:23it gives us the best innovation 0:26and it gives us the best models 0:28and so today we're going to look at the 0:30open source that's in Watson X and we're 0:32going to look at it from three different 0:33aspects we're going to look at it from 0:35model training and validation 0:37we're going to look at it from model 0:39tuning and inferencing 0:41and we're going to look at it from data 0:43Gathering and Analytics 0:46okay let's get started with model 0:48training and validation 0:50training and validating models can take 0:52a large amount of cluster resources 0:54especially when the models we're looking 0:56at are those huge multi-billion 0:59parameter Foundation models that 1:02everyone's talking about 1:04so to efficiently use a cluster and to 1:06make it easier for data scientists we 1:08have an open source project called code 1:10flare 1:11code Flair provides user-friendly 1:14abstractions for scaling queuing and 1:17deploying machine learning workloads it 1:19integrates Ray kubrey and pytorch to 1:22provide these features with Ray it 1:24provides a job abstraction kubrey allows 1:27Ray to run on kubernetes platforms like 1:29openshift and we'll talk a little bit 1:31more about pytorch in a minute 1:34let's look at a typical code flare use 1:36case again the first thing it's going to 1:38allow us to do is spin up array cluster 1:41it's then going to allow the data 1:44scientist to submit training jobs to the 1:47cluster if the openshift cluster is 1:49heavily used and there aren't resources 1:51available code flare is able to actually 1:53cue the jobs and wait till there's 1:55resources available to run the jobs 1:58and in some cases if the cluster is full 2:02it can actually be scaled up and so it's 2:05possible to actually scale up the 2:06cluster in certain cases from code flare 2:09and then when all the training and 2:10validation is done it can actually 2:12delete the array jobs and and take them 2:16off the cluster 2:17so again what's nice about code flare is 2:21it enables the data scientist to 2:23efficiently use a cluster or in some 2:26cases multiple openshift clusters 2:29and not have them worry about the 2:31infrastructure underneath 2:33we just looked at how we run model 2:37training and validation on a cluster but 2:40now let's look at how we actually 2:42represent those models 2:44and the open source project that we use 2:46to represent the models is pytorch 2:49pytorch provides some key features for 2:52representing models one of which is 2:55tensor support 2:57what's a tensor well it's a huge 2:59multi-dimensional array that supports 3:03all those weighted values or 3:05probabilities that are in the model that 3:07we tweak over time to get the model 3:08right to be able to predict things 3:11correctly 3:12the other key feature that pytorch 3:14provides are GPU support and distributed 3:17training 3:18when we train the models we're actually 3:22doing large amounts of computation and 3:25the gpus that Pi torch is able to 3:28effectively use allow us to do that very 3:30efficiently and pytorch also provides 3:32distributed training so with those large 3:34Foundation models that wouldn't fit on a 3:36single machine pytorch enables us to do 3:38distributed training across a large 3:40number of machines 3:42let's look at the key features that 3:44pytorch provides one of which is neural 3:47network creation there's different types 3:49of neural networks and pytorch makes it 3:51easy to create all the different popular 3:53types of neural networks 3:54pytorch also provides easy loading of 3:57data another key feature of of Pi 4:00torch's training Loops so built-in easy 4:02to use training Loops that are tweaking 4:04the model data to improve its ability to 4:07more accurately provide inferencing 4:11and finally pie torch also provides 4:13built-in model adjustments the key one 4:16here is the auto gradient calculation so 4:19think from your Calculus days when 4:20you're calculating gradients having that 4:23feature built in making the minor tweaks 4:26to the model so that it improves it and 4:28gets it over time to to do it be a 4:31better predictor and better usage this 4:34is what pytorch provides 4:36we just looked at how to represent 4:38models but now let's look at model 4:41tuning and inferencing and what do we 4:43mean by this well we want to be able to 4:46serve a large number of AI models and be 4:50able to do it at scale on openshift 4:54so the open source projects that we look 4:56at the first key one is k-serv model 4:59mesh so this is what we used to actually 5:01serve up the models and originally there 5:04was just k-serv which would allow us to 5:07put one model in a single pod so one pod 5:10one pod per model 5:12uh that's not very efficient at all and 5:15k-serv was merged with another open 5:17source project called Model mesh and 5:19model mesh is much better at at being 5:23able to efficiently get large thousands 5:27of models in a single pod 5:29so between these two technologies we're 5:31able to serve up thousands of models 5:33efficiently on an openshift cluster now 5:36where are we going to find all these 5:38models well 5:40hugging phase has over 200 000 models 5:43open source models it's typically 5:45referred to as the GitHub of models and 5:48IBM has a partnership with hugging face 5:50and again it's a great place to find 5:53great models to use on our IBM Watson X 5:56offerings the other key open source 5:59Technologies we have are kkit kkit is an 6:03open source project that provides apis 6:05for prompt tuning so again 6:08typically on the inferencing side you're 6:11you're serving up the models but you 6:13also in some cases need to do a little 6:15bit of tuning to to improve the results 6:18and kkit provides tuning apis to do that 6:20the next technology is kubeflow kuflow 6:25provides orchestration of machine 6:27learning workloads and again allow you 6:29to build those machine learning 6:31pipelines that you that you need to make 6:33life easy so again we have a a wonderful 6:36large number of Open Source projects 6:38that provide our prompt tuning and 6:40inferencing all running on openshift 6:43now let's switch gears and look at data 6:45Gathering and Analytics and the open 6:47source project that we use for that is 6:49presto 6:51what is presto presto is an SQL query 6:54engine and it's used for open data 6:58analytics and for the open data lake 7:00house 7:01and let's look at the key features that 7:03it provides 7:04high performance 7:07Presto is highly scalable 7:10it provides Federated queries 7:12and it's able to query the data where it 7:14lives 7:15I hope I've convinced you that Watson X 7:18has continued IBM's long tradition of 7:22contributing to open source and 7:23leveraging open source and its offerings 7:26if you'd like to learn more please check 7:28out the links below