What does code generation mean in avro - hadoop What does code generation mean in avro - hadoop hadoop hadoop

What does code generation mean in avro - hadoop


It's not a silly question -- it's actually a very important aspect of Avro.

With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.

In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.

In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).

Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).

In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).

Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.

A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.