How to return Struct from Hive UDF? How to return Struct from Hive UDF? hadoop hadoop

How to return Struct from Hive UDF?


Here is a very simple example of such kind of UDF.It receives an User-Agent string, parse it using external lib and returns a structure with 4 text fields:

STRUCT<type: string, os: string, family: string, device: string>

You need to extend GenericUDF class and override two most important methods: initialize and evaluate.

initialize() describes the structure itself and defines data types inside.

evaluate() fills up the structure with actual values.

You don't need any special classes to return, struct<> in Hive is just an array of objects in Java.

import java.util.ArrayList;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import org.apache.hadoop.io.Text;import eu.bitwalker.useragentutils.UserAgent;public class UAStructUDF extends GenericUDF {    private Object[] result;    @Override    public String getDisplayString(String[] arg0) {        return "My display string";    }    @Override    public ObjectInspector initialize(ObjectInspector[] arg0) throws UDFArgumentException {        // Define the field names for the struct<> and their types        ArrayList<String> structFieldNames = new ArrayList<String>();        ArrayList<ObjectInspector> structFieldObjectInspectors = new ArrayList<ObjectInspector>();        // fill struct field names        // type        structFieldNames.add("type");        structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);        //family        structFieldNames.add("family");        structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);        // OS name        structFieldNames.add("os");        structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);        // device        structFieldNames.add("device");        structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);        StructObjectInspector si = ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames,                structFieldObjectInspectors);        return si;    }    @Override    public Object evaluate(DeferredObject[] args) throws HiveException {        if (args == null || args.length < 1) {            throw new HiveException("args is empty");        }        if (args[0].get() == null) {            throw new HiveException("args contains null instead of object");        }        Object argObj = args[0].get();        // get argument        String argument = null;             if (argObj instanceof Text){            argument = ((Text) argObj).toString();        } else if (argObj instanceof String){            argument = (String) argObj;        } else {            throw new HiveException("Argument is neither a Text nor String, it is a " + argObj.getClass().getCanonicalName());        }        // parse UA string and return struct, which is just an array of objects: Object[]         return parseUAString(argument);    }    private Object parseUAString(String argument) {        result = new Object[4];        UserAgent ua = new UserAgent(argument);        result[0] = new Text(ua.getBrowser().getBrowserType().getName());        result[1] = new Text(ua.getBrowser().getGroup().getName());        result[2] = new Text(ua.getOperatingSystem().getName());        result[3] = new Text(ua.getOperatingSystem().getDeviceType().getName());        return result;    }}


There is a concept of SerDe ( serializer and deserialzer ) in HIVE that can be used with the kind of data format you are playing it. It serializes the objects (complex) and then de-serializes it according to the need.For instance, if you have a JSON file, that contains objects and values, so you need a way to store that content in hive.For that you weill use a JsonSerde, that is actually a jar file , containing the parser code written in java for playing around with Json data.

SO now you have a jar( SerDe), and the other requirement is for a schema to store that data.For eg: for XML files you need XSD,similarly for JSON you define object ,arrays and structures relations.You can check this link:http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.htmlPlease let me know if this helps and solves your purpose :)