avroschemanifi

The Complete Apache Avro Schema Guide — Structure, Data Types, Java Processing, and NiFi Issues

A comprehensive overview of Avro Schema: purpose and structure, Primitive/Complex/Logical data types, null handling, Java serialization/deserialization, and NiFi integration pitfalls.

Data DynamicsApril 12, 202612 min read

Apache Avro is a row-based data serialization framework born in the Hadoop ecosystem. Because the schema is stored alongside the data, producers and consumers can evolve independently, and binary encoding provides advantages in both size and speed compared to JSON/CSV. This post systematically covers Avro Schema.

1. Purpose of Avro Schema

Data serialization/deserialization: Fast and compact data exchange in binary format
Schema evolution: Forward/backward compatibility when adding or removing fields
Hadoop ecosystem standard format: Native support in Hive, Spark, NiFi, Kafka, Flink, and more
Schema Registry integration: Centralized schema management when combined with Confluent Schema Registry and similar tools

2. Structure of Avro Schema

Avro Schema is defined in JSON format.

{
  "type": "record",
  "name": "User",
  "namespace": "io.datadynamics.example",
  "doc": "A record representing user information",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

Key	Description
`type`	Top level is typically `"record"`
`name`	Schema (record) name
`namespace`	Namespace to prevent name collisions, similar to Java packages
`doc`	Schema description (optional)
`fields`	Array of fields. Each field has `name`, `type`, and optionally `default`, `doc`, `order`

3. Primitive Data Types

Avro Type	Size	Java Mapping	Description
`null`	0	`null`	No value
`boolean`	1 bit	`boolean`	true / false
`int`	4 bytes	`int`	32-bit integer
`long`	8 bytes	`long`	64-bit integer
`float`	4 bytes	`float`	32-bit floating point
`double`	8 bytes	`double`	64-bit floating point
`bytes`	variable	`ByteBuffer`	Arbitrary byte sequence
`string`	variable	`CharSequence` / `String`	UTF-8 string

4. Logical Data Types — Dates and Times

Avro extends the meaning of Primitive types by adding logicalType annotations on top of them. Date/time-related Logical Types are particularly diverse.

4.1 Date

{"name": "birth_date", "type": {"type": "int", "logicalType": "date"}}

Base type: int
Meaning: Number of days since 1970-01-01
Example: 19827 → 2024-04-12

4.2 Time

logicalType	Base Type	Precision	Example Value
`time-millis`	`int`	milliseconds	`43200000` → 12:00:00.000
`time-micros`	`long`	microseconds	`43200000000` → 12:00:00.000000

{"name": "login_time", "type": {"type": "int", "logicalType": "time-millis"}}
{"name": "precise_time", "type": {"type": "long", "logicalType": "time-micros"}}

4.3 Timestamp

logicalType	Base Type	Precision	Timezone
`timestamp-millis`	`long`	milliseconds	UTC
`timestamp-micros`	`long`	microseconds	UTC
`local-timestamp-millis`	`long`	milliseconds	Local (no timezone info)
`local-timestamp-micros`	`long`	microseconds	Local (no timezone info)

{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
{"name": "event_ts", "type": {"type": "long", "logicalType": "timestamp-micros"}}
{"name": "local_created", "type": {"type": "long", "logicalType": "local-timestamp-millis"}}
{"name": "local_event", "type": {"type": "long", "logicalType": "local-timestamp-micros"}}

The difference between timestamp-millis and timestamp-micros is precision. millis uses 1713945600000 (13 digits), micros uses 1713945600000000 (16 digits). Choose based on the precision of your data source. For example, an RDBMS TIMESTAMP(6) has microsecond precision, so timestamp-micros is appropriate.

local-timestamp-* was added in Avro 1.10+. Use it when you want to store the time "as-is" without timezone conversion.

4.4 Other Logical Types

logicalType	Base Type	Description
`decimal`	`bytes` or `fixed`	Fixed-point decimal. `precision` and `scale` are required
`uuid`	`string`	RFC 4122 UUID
`duration`	`fixed(12)`	months(4) + days(4) + millis(4)

{"name": "price", "type": {"type": "bytes", "logicalType": "decimal", "precision": 10, "scale": 2}}
{"name": "uuid", "type": {"type": "string", "logicalType": "uuid"}}

5. Complex Data Types

5.1 Array

{
  "name": "tags",
  "type": {
    "type": "array",
    "items": "string"
  }
}

items can be any type. Nested records are also possible:

{
  "name": "addresses",
  "type": {
    "type": "array",
    "items": {
      "type": "record",
      "name": "Address",
      "fields": [
        {"name": "city", "type": "string"},
        {"name": "zipcode", "type": "string"}
      ]
    }
  }
}

5.2 Map

Keys are always string, and the value type is specified with values.

{
  "name": "metadata",
  "type": {
    "type": "map",
    "values": "string"
  }
}

5.3 Enum

{
  "name": "status",
  "type": {
    "type": "enum",
    "name": "Status",
    "symbols": ["ACTIVE", "INACTIVE", "SUSPENDED"]
  }
}

5.4 Fixed

A fixed-length byte array.

{
  "name": "md5",
  "type": {
    "type": "fixed",
    "name": "MD5",
    "size": 16
  }
}

5.5 Union

Allows one of several types. Essential for creating nullable fields.

{"name": "middle_name", "type": ["null", "string"], "default": null}

6. The Difference Between Nullable and Non-Nullable Fields

This is the most common mistake point in Avro schemas.

Non-nullable field

{"name": "name", "type": "string"}

A value is mandatory.
If the value is null, serialization throws an exception
When adding this field in schema evolution, previous data cannot be read (no default exists)

Nullable field

{"name": "name", "type": ["null", "string"], "default": null}

If there's no value, it's stored as null
In the union notation ["null", "string"], the first type must match the type of the default
When adding this field in schema evolution, previous data can still be read (filled with null as default)

Order matters

// Correct: default is null, so "null" comes first
{"name": "email", "type": ["null", "string"], "default": null}
 
// Wrong: "string" is first but default is null
{"name": "email", "type": ["string", "null"], "default": null}  // Error!

Practical recommendation: Always use the ["null", "actual_type"] + "default": null pattern for optional fields. This guarantees schema evolution compatibility.

Schema Evolution Compatibility Comparison

Change	Non-nullable	Nullable (default: null)
Add new field → read previous data	Fails if no default	Succeeds, filled with null
Remove field → read new data	Ignored	Ignored
Writer/Reader schema mismatch	Fails strictly	Handles flexibly

7. Avro Serialization / Deserialization in Java

7.1 Maven dependency

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro</artifactId>
  <version>1.11.3</version>
</dependency>

7.2 Schema definition and GenericRecord approach

This approach handles things dynamically without code generation.

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
 
public class AvroGenericExample {
 
    public static void main(String[] args) throws Exception {
        // 1. Define schema
        String schemaJson = """
            {
              "type": "record",
              "name": "User",
              "namespace": "io.datadynamics.example",
              "fields": [
                {"name": "id", "type": "long"},
                {"name": "name", "type": "string"},
                {"name": "email", "type": ["null", "string"], "default": null},
                {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
              ]
            }
            """;
        Schema schema = new Schema.Parser().parse(schemaJson);
 
        // 2. Serialize — Write Avro file
        File file = new File("users.avro");
        DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
        try (DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<>(writer)) {
            fileWriter.create(schema, file);
 
            GenericRecord user = new GenericData.Record(schema);
            user.put("id", 1L);
            user.put("name", "John Doe");
            user.put("email", "john@example.com");
            user.put("created_at", System.currentTimeMillis());
            fileWriter.append(user);
 
            GenericRecord user2 = new GenericData.Record(schema);
            user2.put("id", 2L);
            user2.put("name", "Jane Smith");
            user2.put("email", null);  // nullable field
            user2.put("created_at", System.currentTimeMillis());
            fileWriter.append(user2);
        }
 
        // 3. Deserialize — Read Avro file
        DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
        try (DataFileReader<GenericRecord> fileReader = new DataFileReader<>(file, reader)) {
            while (fileReader.hasNext()) {
                GenericRecord record = fileReader.next();
                System.out.println("ID: " + record.get("id")
                    + ", Name: " + record.get("name")
                    + ", Email: " + record.get("email")
                    + ", Created: " + record.get("created_at"));
            }
        }
    }
}

7.3 Code generation (SpecificRecord) approach

Use avro-maven-plugin to auto-generate Java classes from .avsc files.

<plugin>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-maven-plugin</artifactId>
  <version>1.11.3</version>
  <executions>
    <execution>
      <phase>generate-sources</phase>
      <goals><goal>schema</goal></goals>
      <configuration>
        <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
        <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
      </configuration>
    </execution>
  </executions>
</plugin>

Using the generated classes provides type safety:

// Using auto-generated User class
User user = User.newBuilder()
    .setId(1L)
    .setName("John Doe")
    .setEmail("john@example.com")
    .setCreatedAt(System.currentTimeMillis())
    .build();

7.4 Serialization to byte array (for Kafka, etc.)

Use this when converting to a byte array instead of a file.

import org.apache.avro.io.EncoderFactory;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.BinaryDecoder;
import java.io.ByteArrayOutputStream;
 
// Serialize
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
writer.write(record, encoder);
encoder.flush();
byte[] bytes = out.toByteArray();
 
// Deserialize
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
GenericRecord result = reader.read(null, decoder);

8. Advantages of Avro Files

Advantage	Description
Embedded schema	Schema is included in the file header, eliminating the need for separate schema management
Compact binary	50-80% size reduction compared to JSON
Schema evolution	Forward/Backward/Full compatibility support
Fast serialization	Performance comparable to Protobuf
Compression support	Built-in block compression with Snappy, Deflate, Zstandard
Splittable	Can be split by block for parallel processing on HDFS (MapReduce, Spark)
Multi-language support	Libraries available for Java, Python, C, C++, Go, and more
Hadoop ecosystem integration	Native support in Hive, Pig, Spark, Kafka, NiFi, and more

9. Disadvantages and Caveats of Avro Files

9.1 Not human-readable

As a binary format, contents cannot be inspected with tools like cat or less. Debugging requires JSON conversion with avro-tools.

java -jar avro-tools-1.11.3.jar tojson users.avro

9.2 Schema evolution pitfalls

Field renaming breaks compatibility. Avro matches by field name, so rename = delete + add
Be careful with Union type changes: ["null", "string"] → ["null", "int"] is not compatible
Adding a Non-nullable field without default makes previous data unreadable
aliases can work around field name changes, but all consumers must recognize the new schema

9.3 Dangers of Enum changes

Deleting an existing symbol from an Enum makes it impossible to deserialize previous data containing that value. Only add Enum symbols — never delete them.

9.4 Precision issues with decimal type

bytes-based decimal has variable length, requiring additional processing for sorting
fixed-based decimal requires calculating the exact size for the precision
Precision/scale mismatches when integrating with Hive/Spark can cause data truncation or exceptions

9.5 Large collection performance

If an Array or Map contains tens of thousands or more elements, the serialization/deserialization cost for a single record spikes. In such cases, normalizing into separate records is better.

9.6 Timestamp precision mismatch

If the source system has nanosecond precision but the Avro schema uses timestamp-millis, data loss occurs. Conversely, storing millisecond data as timestamp-micros results in unnecessarily large values. Always verify the source's precision first.

9.7 Writer/Reader schema mismatch

Avro allows the writer schema and reader schema to differ, but incompatible changes cause an AvroTypeException at runtime. Using a Schema Registry to validate compatibility in advance is recommended.

10. Issues When Using Avro Schema with NiFi

10.1 AvroSchemaRegistry and schema text

When registering schemas in NiFi's AvroSchemaRegistry, JSON string escaping issues frequently occur. Pay special attention to line breaks getting corrupted or quotes being converted during copy/paste in the UI.

10.2 Null handling in ConvertRecord

When converting CSV to Avro with the ConvertRecord processor, CSV empty strings ("") are handled differently from Avro null.

When an empty string arrives for a field defined as ["null", "string"], it's stored as "" rather than null
Conversely, if the CSV column is completely missing and the schema is Non-nullable, the conversion fails

10.3 Timestamp conversion issues

When using ConvertRecord or UpdateRecord in NiFi:

Conversion between source data date formats (e.g., "2026-04-12 15:30:00") and Avro's timestamp-millis (epoch milliseconds) may not happen automatically
You may need to handle epoch conversion manually with RecordPath or a separate UpdateAttribute
Timezone settings follow the JVM default, so results may vary across servers

10.4 FlowFile compatibility during schema evolution

When you change an Avro schema mid-pipeline in NiFi:

FlowFiles already queued are serialized with the previous schema
When a processor with the changed schema reads these FlowFiles, AvroTypeException or missing fields occur
Drain the queue before changing the schema, or use the "Embedded Avro Schema" option in the Reader

10.5 Memory issues with large Avro records

NiFi's Record-based processors load entire FlowFiles into memory for processing. If Avro records contain large bytes fields or Arrays with tens of thousands of elements, NiFi node heap memory can be exhausted.

10.6 Schema Access Strategy configuration errors

When the Schema Access Strategy setting in NiFi's Record Reader/Writer doesn't match, schemas cannot be found.

Use Embedded Avro Schema: Uses the schema from the Avro file header
Use Schema Name: Looks up by name from AvroSchemaRegistry
Use Schema Text: Schema JSON entered directly in the processor

Mixing these or selecting the wrong one results in "Schema not found" or "Unable to obtain schema" errors.

10.7 decimal logicalType support issues

In some NiFi versions, when processing the decimal logicalType, precision/scale may be dropped, or values may be corrupted during bytes ↔ BigDecimal conversion. For financial data where precision matters, always validate values before and after conversion.

Summary

Avro Schema is one of the most widely used serialization formats in the big data ecosystem. Thanks to its three strengths — embedded schema, schema evolution, and binary compression — it's well-suited for large-scale data pipelines. However, its binary nature makes debugging difficult, and without a precise understanding of schema evolution rules, you'll encounter data compatibility issues in production. Pay particular attention to null handling, timestamp conversion, and Schema Access Strategy settings when integrating with NiFi.

— Data Dynamics Engineering Team