.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_convert_pipeline_vectorizer.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_convert_pipeline_vectorizer.py: Train, convert and predict with ONNX Runtime ============================================ This example demonstrates an end to end scenario starting with the training of a scikit-learn pipeline which takes as inputs not a regular vector but a dictionary ``{ int: float }`` as its first step is a `DictVectorizer `_. .. contents:: :local: Train a pipeline ++++++++++++++++ The first step consists in retrieving the boston datset. .. GENERATED FROM PYTHON SOURCE LINES 22-32 .. code-block:: default import pandas from sklearn.datasets import load_boston boston = load_boston() X, y = boston.data, boston.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) X_train_dict = pandas.DataFrame(X_train[:,1:]).T.to_dict().values() X_test_dict = pandas.DataFrame(X_test[:,1:]).T.to_dict().values() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2. The Boston housing prices dataset has an ethical problem. You can refer to the documentation of this function for further details. The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning. In this special case, you can fetch the dataset from the original source:: import pandas as pd import numpy as np data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) target = raw_df.values[1::2, 2] Alternative datasets include the California housing dataset (i.e. :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing dataset. You can load the datasets as follows:: from sklearn.datasets import fetch_california_housing housing = fetch_california_housing() for the California housing dataset and:: from sklearn.datasets import fetch_openml housing = fetch_openml(name="house_prices", as_frame=True) for the Ames housing dataset. warnings.warn(msg, category=FutureWarning) .. GENERATED FROM PYTHON SOURCE LINES 33-34 We create a pipeline. .. GENERATED FROM PYTHON SOURCE LINES 34-44 .. code-block:: default from sklearn.pipeline import make_pipeline from sklearn.ensemble import GradientBoostingRegressor from sklearn.feature_extraction import DictVectorizer pipe = make_pipeline( DictVectorizer(sparse=False), GradientBoostingRegressor()) pipe.fit(X_train_dict, y_train) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Pipeline(steps=[('dictvectorizer', DictVectorizer(sparse=False)), ('gradientboostingregressor', GradientBoostingRegressor())]) .. GENERATED FROM PYTHON SOURCE LINES 45-47 We compute the prediction on the test set and we show the confusion matrix. .. GENERATED FROM PYTHON SOURCE LINES 47-52 .. code-block:: default from sklearn.metrics import r2_score pred = pipe.predict(X_test_dict) print(r2_score(y_test, pred)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.877658073886333 .. GENERATED FROM PYTHON SOURCE LINES 53-59 Conversion to ONNX format +++++++++++++++++++++++++ We use module `sklearn-onnx `_ to convert the model into ONNX format. .. GENERATED FROM PYTHON SOURCE LINES 59-69 .. code-block:: default from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType, Int64TensorType, DictionaryType, SequenceType # initial_type = [('float_input', DictionaryType(Int64TensorType([1]), FloatTensorType([])))] initial_type = [('float_input', DictionaryType(Int64TensorType([1]), FloatTensorType([])))] onx = convert_sklearn(pipe, initial_types=initial_type) with open("pipeline_vectorize.onnx", "wb") as f: f.write(onx.SerializeToString()) .. GENERATED FROM PYTHON SOURCE LINES 70-72 We load the model with ONNX Runtime and look at its input and output. .. GENERATED FROM PYTHON SOURCE LINES 72-82 .. code-block:: default import onnxruntime as rt from onnxruntime.capi.onnxruntime_pybind11_state import InvalidArgument sess = rt.InferenceSession("pipeline_vectorize.onnx", providers=rt.get_available_providers()) import numpy inp, out = sess.get_inputs()[0], sess.get_outputs()[0] print("input name='{}' and shape={} and type={}".format(inp.name, inp.shape, inp.type)) print("output name='{}' and shape={} and type={}".format(out.name, out.shape, out.type)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none input name='float_input' and shape=[] and type=map(int64,tensor(float)) output name='variable' and shape=[None, 1] and type=tensor(float) .. GENERATED FROM PYTHON SOURCE LINES 83-85 We compute the predictions. We could do that in one call: .. GENERATED FROM PYTHON SOURCE LINES 85-91 .. code-block:: default try: pred_onx = sess.run([out.name], {inp.name: X_test_dict})[0] except (RuntimeError, InvalidArgument) as e: print(e) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: ((seq(map(int64,tensor(float))))) , expected: ((map(int64,tensor(float)))) .. GENERATED FROM PYTHON SOURCE LINES 92-94 But it fails because, in case of a DictVectorizer, ONNX Runtime expects one observation at a time. .. GENERATED FROM PYTHON SOURCE LINES 94-96 .. code-block:: default pred_onx = [sess.run([out.name], {inp.name: row})[0][0, 0] for row in X_test_dict] .. GENERATED FROM PYTHON SOURCE LINES 97-98 We compare them to the model's ones. .. GENERATED FROM PYTHON SOURCE LINES 98-100 .. code-block:: default print(r2_score(pred, pred_onx)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.9999999999999737 .. GENERATED FROM PYTHON SOURCE LINES 101-103 Very similar. *ONNX Runtime* uses floats instead of doubles, that explains the small discrepencies. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.113 seconds) .. _sphx_glr_download_auto_examples_plot_convert_pipeline_vectorizer.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_convert_pipeline_vectorizer.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_convert_pipeline_vectorizer.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_