ONNX and ONNX Runtime
ONNX and ONNX Runtime

[MUSIC].>>We’re ready for our 4:00 pm talk, and this is the last talk of the day. But this is going to
be really exciting. We’re going to learn about ONNX from runoff from the ONNX Runtime Team.>>Hi everyone. My name is Pranav Sharma and I work
in the ONNX Runtime Team. This is in the AI platforms
organization under Eric Boyd. So let’s get started and I hopefully
that this gets you excited. This is the last talk, so let’s see. So agenda wise, I’ll give you a brief introduction and a survey of the ONNX ecosystem as to what
is the motivation behind it, why we started developing ONNX, then we’ll see how widespread
the adoption is in Microsoft, which teams are using it, how they’re successful, how they have been able to
reduce their COGS, etc. Then we’ll go into
the technical design for both ONNX and ONNX Runtime. I’ll go through what is
the difference between ONNX and ONNX Runtime as we see. So let’s see the introduction
to the ONNX system. This is an advanced ML day-to-day, so this should be very basic but let me talk about it for 30 seconds. It a typical ML workflow,
you have data collection, you train to fit the model, and then you deploy your
model for inferencing. What typically happens is that
there’s an overlap between the roles along the different
jobs which you have to do. So a data scientist is mostly doing model preparation and an ML
engineer is focused on making sure the models are
running with highest speed and performance on specific
deployment targets. So many teams in Microsoft, they want to use machine learning
to delight their customers, they want to improve their products. But reality, what happens is that different teams are using different frameworks
to train their models. Somebody using PyTorch,
TensorFlow, Scikit, Keras, Learn Nodes, ML.NET,
and so on and so forth. We want to encourage actually teams to use
different training frameworks, whatever works best for them. The other part of the
story is actually deploying your modeling production, and making sure it is working
at the highest efficiency. You are making sure that you’re
using the hardware to the fullest. So what really happens is that
so you have the training side, you have the deployment side, and now you have to
think about how to use your trained model
and make sure that it works for the specific
target deployment framework. So with so many different
frameworks coming up, you have to figure out, okay, does this framework support this
specific deployment target? Is it optimized for FPGAs? Is it optimized for Open Windows? Is it optimized for cascade
lake CPUs and Intel? Does this framework offer that? Does that framework offer that? Then coupled with that,
you have a bunch of heterogeneous set of
tools which you have to support, maintain going along. This is really where ONNX comes in, and it sits in the middle and it’s
the demultiplexer of all this, where you’re saying that you can train your model in
whatever framework you want, and then convert it into an
interoperable AI model format. So what is really ONNX
in the full form? It’s open neural network exchange, and what it is really
trying to represent is it’s a uniform model
representation format. So you take your model
which is trained in any of these frameworks and you
convert it into the ONNX format, and now you can use that to deploy to any of these
deployment targets. The ONNX models are optimized for each of
these deployment targets. We’ll talk about how
this optimizations are done and how we go
about receiving that. So what is ONNX? It’s an open neural network exchange. It’s an interoperable
standard format. It is open source. This was started sometime in
December or about late 2017. This was initially
started by Facebook. So they had PyTorch and Caffe, and they wanted to unify these models in for their
own production usage, and then Microsoft joined
in early on with Amazon. So this is a consortium really
of these three companies, and what it is representing is
that it’s the common format for the model and also it specifies
a full operators spec. So every operator is doing one job in the neural
network for example; Convolution, ReLu, all these are the different
operators which we’ll talk about. So since then, we have about more
than 40-50 companies which have joined the consortium
and they are actively participating and investing
in the ONNX format. These are some of the
open source related stuff where now we have open
governance, so you have this. This is probably of less interest to the community here
but I’ll skip that. So now, you have trained your model in your training framework of choice, and now you need to get to an ONNX
model and that’s where we say, how do you get to the ONNX model? So first thing is that we already
have three trained ONNX models uploaded to the ONNX model
zoo and you can go and get them so you have your
famous different ONNX models. They’re the Mask, R-CNN both
from Keras and PyTorch, and various other famous ONNX models. The other way is that you can
use the Azure customization Auto ML to create the model and then
convert it into the ONNX format. The third way and I think this is probably the most popular
one currently which is you take your training framework
and we have a bunch of open source conversion tools which you can use to convert the model
in the specific format into ONNX, and off which PyTorch actually as a functionality where you
can export the model as part of the training job itself. So there is no separate
conversion step it is required. The other is, of course, you can train it on
Azure Machine learning service and you can then convert it. So later on, I’ll talk about a tool which we have recently open
source which is called Olive, where there is a central place on Azure where we can
go, take your model, convert it to ONNX, and then find out what is the
best optimal strategy. Hey.>>I think that would
[inaudible] TensorFlow? TensorFlow is the [inaudible].>>Yes, TensorFlow is also->>That one may still be
the case of TensorFlow.>>[inaudible].>>No. Okay. Okay. Fine.>>So these are the
different Open-source, every single converter
is Open-source. So LightGBM, LibSVM, Spark ML. Spark ML is in the
preview stage currently. Keras, and TensorFlow,
and Scikit-Learn, yeah. So these are some of the
various ways and the APIs, which are used to convert the models. Chainer, Keras, and
then TensorFlow itself. So we are also building support
for TensorFlow 2.0 as we speak. So that should be out as well. The next step is really influencing and that is
where ONNX Runtime comes in. So you have triangular model, you have converted into ONNX, and now you want to run it as fast as possible on the
device of your choice, or the deployment
target of your choice. ONNX Runtime is the implementation of the ONNX standard and this was also developed around the same time
when ONNX was Open-sourced. So this was Open-sourced
back in December, 2017. Just today, at about
12:00 p.m., we are 1.0. We just released the 1.0
version of ONNX Runtime. So it has the full support of the ONNX spec and every single
operator has been implemented. So given an ONNX model, you can expect the ONNX Runtime to infer the ONNX model completely. The other is that, it is extensible, it is modular, and
later on in the talk, I will talk about the specific
design details of ONNX Runtime of how we implement
this extensibility, and how it is useful for
production deployments. Along with that, this is present
in every single Windows device, it is part of the
OS, it is installed. This is the WinML. WinML is actually a wrapper on
top of the ONNX Runtime. So the core framework, the core library, which WinML
is running is ONNX Runtime. So this is the Windows or
Microsoft offering of Core ML. So we have support for a lot of architectures and
hardware accelerators. So this is cross-platform. So Windows, Linux, Mac, C#, ARM 64, ARM 32, some of these are available as off-the-shelf packages which
you can download directly. For some of the architectures, you just need to build
from the source, we have instructions on our
website, and you can follow them. So as for the hardware accelerators, we have a lot of hardware
vendors which have come forward and they have asked for how to integrate
into ONNX Runtime, so that we can take advantage of
the best hardware capabilities. So we have TensorRT, DirectML is a more recent one
which we released as part of 1.0. This is using the Windows
DirectX technology. The other is Intel’s MKL-DNN, nGraph. NUPHAR is also a recent one, which is based on
completion technologies where the model is
compiled into a JIT, it is JIT compiled, and then run. This uses LLVM and TVM as the underlying frameworks and
then we’ve also OpenVINO. So the next part is, now you have the ONNX and
you need to deploy it. You have a bunch of choices and I won’t go through
every single detail here. But we have different ways, and we have published
reference points, and reference notebooks to
deploy to various devices like, Azure Machine Learning service, Ubuntu VMs, Edge Clouds, and Edge IoT devices. We have the complete workflow
which is documented on one of these notebooks which you can run and deploy it on
any of these devices.>>Is there any work for
explainability of models->>Yes, this is coming up
in the next release, yes. We don’t have it yet. So we’re working on something
called as interpretability, I think that is what
you are referring to, which is explainability
of the model itself.>>Yeah, in Scikit-Learn, you’ve bunch of [inaudible] like [inaudible] but when we convert the ONNX model then
we loose this ability of->>Yes, you’re right. Yes. It’s part of our planning for manganese
what we call as an AIR platform. So yeah, it’s coming up soon. So at Microsoft, let’s see what
is the usage at Microsoft so far. Azure ML, WinML, and ML.NET
already supports ONNX today. As I said, WinML is part of
every single Windows device. So we’re talking about
800, or I think, 900 million Windows devices. Different products have already
deployed them in production, and I will go through examples
of some of these products, and some of the ware, and how much of the
performance benefits we are able to get with each of these. So we have about 60, I think, it’s about 70 now but
yeah, let’s go with this. So we have about 60 plus models in production which have
been converted from the previous models which
they were running already and about 3x performance
improvement on an average. This is a very rough number
because we are doing an average of so many different
products and different runtimes. But this is a rough number about
3x performance improvements. This is the distribution of where we see what different frameworks were used to convert these
models within Microsoft. So you see 41 percent is about TF, SNTK, PyTorch, and these
different distributions. So one example is the missing
determiner from the Office. This is doing the grammar
checks and correction. Here we saw a 14.6 performance gain by converting
the model into ONNX, and then running it
using ONNX Runtime. The second example is the Cognitive Service
extraction of text from images. Here we are seeing a 3x performance
gain using ONNX Runtime. Then we have the Bing Q&A list, and Saurabh was actively involved in integrating with us
from the Bing side. So the Q&A part, so here we see a 2.8x
performance improvement. Then we have the cognitive
vision Computer Service, Computer Vision, annotating
images as you can see. So here we saw a big
reduction in cogs as well along with latency
and throughput reductions. So this is a survey of the
ONNX and the ONNX Runtime, the ecosystem, at a very
high-level of what these are. Now, I’ll go through details of the technical designs of how
the ONNX has been designed? What does it contain and what
are the goals behind it? So this was the first. Some of the design
principles for ONNX was, it has to be interoperable, the format has to be compact. It has to be cross-platform
representation, it has to be backwards compatible. This is something which is very
important for production usage, especially for revenue services. It should support both
deep learning models as well as traditional
machine learning models. This is also some of the
requirements which came from Microsoft itself is that we do want to support traditional
machine learning models. So with this in background
with the design principles, the spec consists of three parts. The first part is the representation of the data-flow graph itself. So how do you represent it
and can it be extensible? So can you make it extensible enough, can you add new operators in it, and can you add new types in it? So the second part
is the definition of the standard types where each
operator takes the inputs, the outputs, and the attributes
which are stored in the graph. The third is, the schema for every single operator which
is present in the ONNX model. The representation which
we chose is protobuf. So Google Protobuf is fairly
popular and you can use any tools to inspect the model itself and see what
is contained in the model. So the model file format. So a little bit about the file
format where it starts with the model structure and the
model has a bunch of attributes. So one is the version, the metadata associated
with the model. So the producer actually can
write a bunch of metadata which can be used later on for
identification of the model. The other is the acyclic
computation graph. We also have a version in
the model but this version is mostly for the sake of the producer to keep track of which version of the model you are producing
at different points of time. So the most important part is
the acyclic dataflow graph. This consists of the inputs, and the outputs of the graph, the list of computation nodes, and the name of the graph. Every computation
node is representing one single operation
in the ONNX graph. So this is corresponding
to convolution, BatchNormalization, ReLU,
and so on and so forth. Each inputs and outputs has their
types associated with them, and what parameters it has, and a list of attributes
which are hard-coded in the graph when you create the
model as oppose to taking inputs. So there is a tool which a
Microsoft employee developed called Netron which is a
graphical representation of how the ONNX model looks like and it’s pretty clean and beautiful which we chose that these are the
inputs, these are the outputs. So you don’t have to actually
manually go and inspect the [inaudible] So we have support for tensors and the standard tensor types
complex uint, bool, string. Then we also support
the non-tenser types such as the Sequences and the Maps. These are mostly used in traditional
machine learning models. So for example the zip map operator
would have a output which is a map of sequences after it
applies the one-hot encoding. So a bunch of these types were incorporated only to support
the machine learning models. Then we have the operator set, and this is an example
of the ReLu spec. So you have the version
of the operator. So the operator is
identified by a domain. So it belongs to a domain. It has the version of it and
then the name of the operator. So the ReLu is the name and we said that this has been
available since version 6, and the other different, it has inputs and outputs.
There is the input. The T corresponds to the type of
the specific input and output, and the types are
specified as a tensor or floats or doubles or whatever. So the general guideline around adding operators to ONNX and anyone
can add an operator to ONNX, summit a PR to the ONNX
open-source GitHub repo. The general guideline
is that it should be the most primitive
operation which you can do, because you should be able to compose different operators and
create an operator itself. So we want the operators to be such that they cannot be
meaningfully further decomposed. When it comes to
performance, actually, you do want to fuse
the ops at some point. We will talk about
fusion a bit later. This is the job of ONNX runtime. The ONNX graph itself is only
representing the data flow. So I think this is a bit stale but we have more than about 140 of our operators and
this is growing list. As we convert and we encounter more models and more operators
which are not supported in ONNX, we keep adding them to the ONNX spec and we keep
implementing them in ONNX runtime. So today, it supports a variety of scenarios like
for example classification, recommendation engines, and
different language processing. So last year, we added
control flow operators to support LSTM operations
and so on and so forth. There’s also a capability to
add custom operations in ONNX. So not all Ops which are
in ONNX might satisfy the specific model and
in which case you should be able to write a custom Op
and incorporate that as well.>>If I have a customer
somehow regular to you, the runtime should be able to->>Yes.>>Okay. So somehow the
runtime has to know that.>>Yes. So the runtime has APIs to register the
customer at runtime.>>Okay.>>With the specific schema
and we’ll go through the API. So you should be able to
register the customer Op. It gets registered in an internal registry which is
separate from ONNX registry.>>Okay.>>Yes.>>Okay.>>So Versioning is actually a very important topic especially
for production models. You don’t always have
the liberty to keep the model and the runtime
in sync with each other. So at some points of time, you can update the model. At some points of time, you may not be able to update
the runtime and vice versa. So you want to update your models in a way that
they can continue to serve. Parts of your deployment might have an older version of the model running or an older version
of the runtime running. So ONNX versioning is
done at three levels. One is the IR version itself which is the version of the file format. So if the portable
format is changing, we update the version
to specify that. Then there is the Opset version. So there are these
concept of Opsets where operators belong to a set
which we call as the Opset, and every model has to specify, this is the Opset which I support. It says that this is
the domain which I belong to and this is the
Opset version which I have. So as new operators get added, they get added to a
new Opset version. So every single time
we add a new operator and until during the release time, the Opset version gets incremented and that
becomes the new Opset then.>>Are older Opsets strictly subsets of newer ones or
things can get [inaudible].>>They can be subsets.
They are subsets actually.>>Okay.>>Yes, they are
subsets. Let me go back. So here it says that this operator has been
available since version 6. So yes, they become part of
the new Opset automatically. So the other thing is
the Operator version. So the operator is
defined by three tuple. So we have the domain, the name of the operator
which we call as the Op type and then
the operator version. So together with these three
different versioning strategies, we have been pretty
successful in running an ONNX model of different Ops
types within ONNX runtime. There are a bunch of
details around when you add an ONNX operator or
when you change the spec or if the spec is unclear if you clarify the spec or if
you add a new attribute, how the versioning should
be changed and etc. You can read more about it on GitHub. So that’s kind of the ONNX spec. Then you come into ONNX runtime which is the implementation
of the spec. One of the primary goals of the
ONNX runtime has been performance. So one is performance then you have backwards
and forward compatibility. You should be able to run any ONNX model which has been
created since a given version, and then it should be cross-platform. So one of the other key points of ONNX runtime is support for a
hybrid execution of models. So you want to run
the model on the GPU, but it’s possible that your
coder doesn’t implement certain operators or maybe the new operators which are inside ONNX they are not
implemented by coder yet, and at that point of time, the default implementation inside ONNX runtime should
be able to take over. Basically, if you give a
model to ONNX runtime, you should be able to
run it whether you’re trying to run it on
GPU or tensor RT and whether the specific
hardware accelerators are able to support these
specific operators or not. So that’s what we call as the
hybrid execution of models. Then other than that, we have a pluggable
architecture for adding more custom hardware
accelerators as you will. So as we go forward, we will look at the API as to how to add a customer
hardware accelerator. So this gives a brief architecture of how a
model runs inside ONNX Runtime. If you see, there are essentially
two phases of running a model. So you first create a
session with the model, and you load the model. After that, you call the Run APIs. Now, loading the model, the first step is to create an in-memory graph
representation of the Protobuf. So basically, unpacking the Protobuf, and you create an inefficient
graph representation in memory. After that, this is the point where we go through
the different fusions. We call it model optimization, or other graph transformations. Think of these as compiler
optimizations like various levels, or O1, O2, and O3. So similar to that, we have
graph transformation levels: L1, L2, L3, and so on. Each level corresponds to a series of transformations
which happened on the graph. So for example, the L1 will
have a transformation like, you will eliminate a bunch of nodes which are not really
relevant for inferencing. So dropouts get eliminated, slice gets eliminated,
a bunch of operators, and we will see a list of
these which are there. Also, you will see fusions
which are happening. So convolution batch
normalization will get fused. Multiplication, add will get fused. So there are different rules, and you can actually plug
in your own rules as well. So if you want a different fusion or a different graph
transformation to apply, we have APIs to actually
add that to ONNX Runtime. So once the graph has been optimized, that’s the time when we begin to partition the graph into
different hardware accelerators. So the flow generally
works as follows, is that the user of the API tells
on excellent time that this is the list of hardware
accelerators where I would like the ONNX model to run on, and this is the preferred list. So ONNX Runtime goes through
it in the serial order, and it tries to assign the graph and the nodes
to specific accelerator, what we call as the
execution provider in the ONNX Runtime bar lines. So if a specific provider is able
to execute that specific node, it will mark it, and then we
move on to the next provider. At the end of this
partitioning process, what you really get is subgraphs which can be
executed by say TensorRT, or subgraphs which are executed
by OpenVINO, or something else. Anything which is not executed by these exhibition providers
is then executed by the CPU, the base, the default CPU provider which is in the
core of ONNX Runtime. The CPU provider implements
every single operator. So this is how you guarantee that
your model will always execute if you have given it to ONNX Runtime even if the hardware accelerators
have limited support for various operators. So they may not have been updated
with the latest ONNX spec. So once the model
partitioning is over, we go through one more
level of optimization, where we give an opportunity to the hardware accelerators
to say, “Okay. Is there anything else which
you would like to fuse or you have specific
ways of fusing things?” That’s the time where we apply the execution provider’s
specific optimizations. After that, comes the execution part where we go through all
the nodes sequentially. We have actually two
modes of operation. One is sequential and parallel. If your model is parallel enough, you can enable the
parallel execution mode and where it will try to
run the graph in parallel. Or, it’ll go through node
by node sequentially, and within each operator, it will try to parallelize
its operation. So this is the overall view. So one of the other examples
of graph optimization is constant folding, dropout, identity. Now, I mean, the model partitioning scheme
which I spoke about was, it’s mostly user-based,
where the user is actually specifying you these are the
providers which I need to run with. ONNX Runtime doesn’t
help the user so far, telling it this is what
you should be running on. So that’s the next phase of the
product which we are working on. So we’ll go through some of these. This is the internal
representation of the IR. We have a read-only
portion of the graph, and then there is the
full graph version where the transformers work on, where they optimize the graph, they rewrite the graph. The second is the graph
partitioning phase. As I said, we have this greedy
scheme of partitioning the graph. Partitioning is really, maybe
we should call it assignment. So the next part. So this is the details
about the optimization. As I said, we have level 0, level 1, and level 2. Level 0, I didn’t speak
about level 0 before, but level 0 is the
transformation of the graph, where if the nodes have been
assigned to different devices, we will insert copy nodes in
the graph such that the data is copied between the different devices without you have to worry about it.>>Just a clarification.
If I give you a big graph, and there’s CPU, there’s a TensorRT, and then there’s an extra GPU, so ONNX tries to figure out
the optimal placement between.>>Today, ONNX Runtime doesn’t
decide the optimal placement.>>Okay.>>It is up to the user. But
there is a tool which we support, which will tell you where
is the optimal placement, which is outside ONNX Runtime, but it’s an open-source tool. Yeah.>>Okay, thank you.>>Yeah. So these are the different graph
optimizations which we run, and as you can see, like convolution add,
convolution model, and then batch normalization,
and then eliminate. So this is the execution provider or the hardware
accelerator interface. This is how you would write one. So we have two kinds, where we have the kernel-based, where you implement
every single operator inside ONNX Runtime and you
register the execution provider. The other scheme is where
you don’t really have the granularity of implementing a single operator at the ONNX level. So what you do is that you know how to implement a subgraph in one. So that’s the other way how you can implement
an execution provider. You can say that I can
run this subgraph, and this is how to run it, and this is the experiment
on how to run it. So we have the compile, we call as the compile API, for running the execution provider. So some of the provider is like
the NGraph, TensorRT, OpenVINO. So they use this strategy
because they don’t have operators which are implemented
at the granularity of the ONNX. So this is some of the
performance numbers. Obviously, if you run it
on a hardware accelerator, you’re obviously going to get
some performance benefits. This is just showing that how using
ONNX Runtime you can leverage these hardware accelerators and
get these performance benefits. So one of the key things is, how do you extend ONNX Runtime? You don’t want to keep
it all stuck into this framework and then
not be able to change it. So there are multiple
extension points, and one is the adding
an execution provider. This is where new hardware vendors, Qualcomm, NXP, and so on, and so forth, they come
to us, and they say. “Okay. We want to write
in hardware accelerator for ONNX Runtime and enable it such that you can run the ONNX
models based on our hardware.” This is the place where you can do it by the execution provider APIs. The other is, the ONNX spec
operators are not enough. They are evolving, and they
take time to evolve sometimes. But you want to shape a model. You can write a custom op and
register it with the ONNX Runtime. That’s another way to do it. The third is, you can add different optimizations
and transformations. So if you’re not happy with
the current level of fusions, you want to extend level 1, you want to extend level 2, so you can do that through
the optimizations. So we have the APIs. The APIs are really like two-fold. So it’s very simple. You have the session, you create the session, and you give the model
name to the session. This loads a session. This loads the model. It runs all the graph
transformations, optimizations, and it makes your model ready to run. The next step is to simply
call run on the model. We have the Python API which conveniently takes a
NumPy array so that you don’t have to convert
it into any other format. So we have it in the C APIs as
well, which is very similar. So you have a session creation, and then you have a new call run. C# is similar, create the
session and run the thing. So this is the different. This is how you would
write custom operators in ONNX Runtime using the C APIs. So create kernel. Kernel is the instance of an operator in ONNX
Runtime, is what we call. You specify the compute function, you specify the describe function and the kernel creation function. So as of today, we
are ONNX, we are 1.0. So all our C APIs are
now ABI compatible, which means in a production
scenario you should be able to take a new version of ONNX Runtime and not worry about breaking binary
compatibility with your product. So you can take advantage of the new performance
benefits which would have come with a newer
version of ONNX Runtime. So API stability is one of the
premier goals of the 1.0 release. We also have full support
for 1.6 Compatibility. A lot of our customers also asked for CentOS 6 support because this was the oldest Linux version which lot
of enterprise customers run on. So we have support for CentOS 6. Then we have a bunch of execution
providers which were added, like NUPHAR is the JIT compilation
based execution provider. We have DirectML which takes advantage of the direct
text technology. Then we also have the
Arm Compute Library which is currently in PR
and getting reviewed. This is the how do you accelerate
computations on ARM 64 machines? So this takes advantage of the actual library to
speed up operations, convolution, and other stuff. So this is the another
one popular question which we have received is that, given an ONNX Model and I
have a bunch of hardware, how do I decide what should be
my optimal configuration to run? So you have to worry about, for example, how many
threads should I configure? What should be my priority
order in which I should run? The ONNX model should be
run on denser IT first or should I give it to [inaudible]? So we have this tool called ONNX Go Live and it’s on Azure
it’s open source. It can be run in
multiple different ways. So what you do is basically, you take your model which has
been trained in any framework, you upload it to this tool and then
it will produce an ONNX model. It will ensure that the conversion is accurate using some test data and it will also do a
performance tuning for you. So given the list of hardware
which you would like to run on, it will produce an optimal
configuration as to this is the priority
order in which you should register the
execution providers. This is the different trading options which you should be using, etc. So I know most people wouldn’t want to upload
their models to the Cloud, so there is a command
line version of it as well which you can run. So some of the upcoming things in ONNX is that now we are
adding the NN API which is to make it available
on Android devices to take advantage of the
best capabilities which Android has to offer
for deep learning operations. Then we have Qualcomm
which is coming up. The Java API is
currently in progress. It is in PR getting reviewed. We are also adding
support for training, and this is not the regular
training which we think about, it is more like tuning
the last pieces of your models so that you can use
the output for other operations. Then we continue to do different performance optimizations
to make sure that it is running as fast as possible? So this is all the differences. We are performance tuning
document, it’s very useful. Where people can see various ways to try different options
to tune their ONNX models. That’s about it. Questions.>>[inaudible] I have a
question. So [inaudible] I see most of the optimization
mark is focused on [inaudible].>>That’s right. Yeah. So
we are in the process of, as when we see different
performance characteristics, we keep optimizing the traditional
machine learning operators. I think the fusion ones they don’t really so much
apply there, right? But the internal optimizations of the implementations
of these operators, they become very important. So how can you parallelize
the operations within a certain OP to ensure this is
running as fast as possible. So we try to compare it with the
original frameworks inferencing options and see are you doing
better than that or not. So this is also part of our
conversion process and we say, okay. So before we have converted
the model to ONNX, you shouldn’t be converting it into an ONNX format and
say it is converted, but it’s not really performing that well as compared to
the original framework, it doesn’t make sense.>>So I think you said something
very briefly that training, after I give you an ONNX model, continuing training from that, even if you don’t have the
original source code, right?>>Yeah.>>That’s [inaudible] is that.>>It’s in progress. It
is being developed, yeah.>>It’s being developed, so it’s something that
would be supported.>>Yes.>>Obvious.>>Yes. It is being
developed as we speak.>>Because that was super useful.>>Yeah. So you want to tune it
to a specific purpose, right?>>Yeah.>>It is not the complete
training, but yes.>>So can you include some
virtualization inside the graph? For example, your
inputs have to be just tensors or can I have
text as an input.>>Yes. You mean strings?>>Yeah.>>Of course. So you can
attend strings as well.>>What operators do they
support, like tokenize?>>Yes. So in, I think it was about
just six months back, is when we added actually
exactly what you are speaking, the tokenization operations, right? So AutoML is the product in Microsoft which is used to train models automatically
given a data. Primarily, they used the
scikit learn models, and that’s when we introduced these operations where you
have the TF-IDF vectorizer, you have the tokenizer,
and all of these things. So these are very well supported.
These are first-class OPs.>>So that would include
all [inaudible] stuff, traditional methologies, right?>>Yes. It should. Where it doesn’t, we send a PR to
add it to the ONNX file. Yeah.>>Do you support the R
models, the R language?>>You mean do we have
language bindings for R?>>Yeah. Can I convert
the model which is a R script to ONNX model?>>Not today. So what kind
of format is it produced?>>It’s a script like.R5.>>Okay.>>It’s R program language.>>Okay.>>Statistical models.>>Yeah. We don’t have
support for this yet, yeah.>>So how much of the entire
design decision is Microsoft led? I understand this is multi party. So this is more like the social dynamics of running
an open source project. Can you comment on that?>>So the social dynamics
applies more for the ONNX Spec and less
for ONNX Runtime. So ONNX runtime is not in the
open governance mode yet.>>I see.>>Right? It is completely
controlled by Microsoft. We definitely welcome
external contributions. We actively review them
and we merge them. We have actually received a lot of very useful external
contributions for ONNX Runtime. But we are definitely the
primary driver for ONNX Runtime, and for the most part,
for ONNX as well.>>Not just specific to
Windows or [inaudible].>>No. So this is one of
the key goals is that we definitely want to be
cross platform, right?>>Yeah.>>We develop on Mac, Windows as well as Linux. So we had only given to 16 support
until the previous release, and we wanted to go all
the way back to CentOS 6, which is really an old version of
Linux we don’t want to support. But we had customers
who wanted to do that. So yeah. Cool. Thank you.

1 thought on “ONNX and ONNX Runtime”

  1. Haroon Khalid says:

    Pranav Sharma bhai, please bring toilets to us in India

Leave a Reply

Your email address will not be published. Required fields are marked *