Custom transformer for sklearn Pipeline that alters both X and y

python numpy machine-learning scikit-learn data-analysis

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing.

As it is now, the transformer API is used to transform the features of a given sample into something new. This can implicitly contain information from other samples, but samples are never deleted.

Another option is to attempt to impute the missing values. But again, if you need to delete samples, treat it as preprocessing before using scikit learn.

python numpy machine-learning scikit-learn data-analysis

You can solve this easily by using the sklearn.preprocessing.FunctionTransformer method (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)

You just need to put your alternations to X in a function

def drop_nans(X, y=None):    total = X.shape[1]                                               new_thresh = total - thresh    df = pd.DataFrame(X)    df.dropna(thresh=new_thresh, inplace=True)    return df.values

then you get your transformer by calling

transformer = FunctionTransformer(drop_nans, validate=False)

which you can use in the pipeline. The threshold can be set outside the drop_nans function.

python numpy machine-learning scikit-learn data-analysis

@eickenberg is the proper and clean answer. Nevertheless, I like to keep everything into one Pipeline, so if you are interested, I created a library (not yet deployed on pypi) that allow to apply transformation on Y:

https://gitlab.com/thibaultB/transformers/

Usage is the following:

df = pd.DataFrame([[0, 1, 2], [3, 4, 5], [6, 7, 8]])df.columns = ["a", "b", "target"]spliter = SplitXY("target") # Create a new step and give it name of column targetpipe = Pipeline([        ("imputer", SklearnPandasWrapper(KNNImputer())),        ("spliter", spliter),         ("scaler", StandardScaler()),        ("rf",            EstimatorWithoutYWrapper(RandomForestRegressor(random_state=45),                                    spliter)) # EstimatorWithoutYWrapper overwrite RandomForestRegressor to get y from spliter just before calling fit or transform    ])pipe.fit(df)res = pipe.predict(df)

Using this code, you can alter the number of rows if you put all the transformer that modify the numbers of rows before the "SplitXY" transformer. Transformer before the SplitXY transformer should keep columns name, it is why I also added a SklearnPandasWrapper that wrap sklearn transformer (that usually return numpy array) to keep columns name.

CodeHunter

Custom transformer for sklearn Pipeline that alters both X and y

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last