How to ensure that a python function generates its output based only on its input? How to ensure that a python function generates its output based only on its input? database database

How to ensure that a python function generates its output based only on its input?


The answer to this is no. What you are looking for is a function that tests for functional purity. But, as demonstrated in this code, there's no way to guarantee that no side effects are actually being called.

class Foo(object):    def __init__(self, x):        self.x = x    def __add__(self, y):        print("HAHAHA evil side effects here...")        # proceed to read a file and do stuff        return self# this looks pure...def f(x): return x + 1# but really...>>> f(Foo(1))HAHAHA evil side effects here...

Because of the comprehensive way objects can redefine their behavior (field access, calling, operator overloading etc.), you can always pass an input that makes a pure function impure. Therefore the only pure functions are those that literally do nothing with their arguments... a class of functions that is generally less useful.

Of course, if you can specify other restrictions, this becomes easier.


Your required restrictions can be broken even if you remove all modules and all functions. The code can get access to files, if it can use attributes of an arbitrary simple object, e.g. of number zero.

(0).__class__.__base__.__subclasses__()[40]('/etc/pas'+'swd')

The index 40 is individual and very typical for Python 2.7, but the index of subclass <type 'file'> can be easily found:

[x for x in (1).__class__.__base__.__subclasses__()if'fi'+'le'in'%s'%x][0]( '/etc/pas'+'swd')

Any combination of white list and blacklist is either insecure and/or too restrictive.The pypy sandbox is robust by the principle without compromise:

... This subprocess can run arbitrary untrusted Python code, but all its input/output is serialized to a stdin/stdout pipe instead of being directly performed. The outer process reads the pipe and decides which commands are allowed or not (sandboxing), or even reinterprets them differently...

Also a solution based on seccomp kernel feature can be secure enough. (blog)


I want to be sure that in future the function will generate the same output as today.

It is easy to write a function that has hard reproducible results and it can not be easily prevented:

class A(object):    "This can be any very simple class"    def __init__(self, x):        self.x = x    def __repr__(self):        return repr(self.x)def strange_function():    # You get a different result probably everytimes.    return list(set(A(i) for i in range(20)))>>> strange_function()[1, 18, 12, 5, 16, 15, 8, 2, 14, 0, 6, 19, 13, 11, 10, 9, 17, 3, 7, 4]>>> strange_function()[0, 9, 14, 3, 17, 5, 6, 11, 8, 1, 15, 7, 12, 13, 2, 10, 16, 4, 19, 18]

... even if you remove everythng that depends on time, random number generator, order based on hash function etc., it is also easy to write a function that sometimes exceeds available memory or timeout limit and sometimes gives a result.


EDIT:
Roman, you wrote recently that you are sure you can believe the user. Then a realistic solution exists. It is to verify the input to and output from a function by recording it to a file and verifing it on a virtual machine running a remote IPython notebook (nice short tutorial video, support for remote computing out of box, restart of the backend service by web document menu from the browser in one second, without loss of data (input/output) in the notebook (html document) because it is created dynamically step by step by our activity triggering the javascript that calls the remote backend).

You need not be interested in internal calls, only the global input and output, until you find a difference. The virtual machine should be able to verify the results independently and reproducible. Configure the firewall that the machine accepts connections from you, but can not initiate an outgoing connection. Configure the filesystem that no data can be saved by the current user and therefore they are not present, except software components. Disable database services. Verify the results input/output in a random order or start two IPython notebook services on different ports and select a random backend for every command line on the notebook, or restart the backend process frequently before anything important. If you find a difference, debug your code and fix it.

You can automate it without "notebook" finally only with IPython remote computing after you don't need interactivity.


What you want is called sandboxing or restricted Python.

Both are mostly dead.

The closest to functional today is http://pypy.readthedocs.org/en/latest/sandbox.html note however that newest build is actually 3 years old.