The Complete Apache Avro Schema Guide — Structure, Data Types, Java Processing, and NiFi Issues
A comprehensive overview of Avro Schema: purpose and structure, Primitive/Complex/Logical data types, null handling, Java serialization/deserialization, and NiFi integration pitfalls.
Apache Avro is a row-based data serialization framework born in the Hadoop ecosystem. Because the schema is stored alongside the data, producers and consumers can evolve independently, and binary encoding provides advantages in both size and speed compared to JSON/CSV. This post systematically covers Avro Schema.
1. Purpose of Avro Schema
- Data serialization/deserialization: Fast and compact data exchange in binary format
- Schema evolution: Forward/backward compatibility when adding or removing fields
- Hadoop ecosystem standard format: Native support in Hive, Spark, NiFi, Kafka, Flink, and more
- Schema Registry integration: Centralized schema management when combined with Confluent Schema Registry and similar tools
2. Structure of Avro Schema
Avro Schema is defined in JSON format.
{
"type": "record",
"name": "User",
"namespace": "io.datadynamics.example",
"doc": "A record representing user information",
"fields": [
{"name": "id", "type": "long"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}| Key | Description |
|---|---|
type | Top level is typically "record" |
name | Schema (record) name |
namespace | Namespace to prevent name collisions, similar to Java packages |
doc | Schema description (optional) |
fields | Array of fields. Each field has name, type, and optionally default, doc, order |
3. Primitive Data Types
| Avro Type | Size | Java Mapping | Description |
|---|---|---|---|
null | 0 | null | No value |
boolean | 1 bit | boolean | true / false |
int | 4 bytes | int | 32-bit integer |
long | 8 bytes | long | 64-bit integer |
float | 4 bytes | float | 32-bit floating point |
double | 8 bytes | double | 64-bit floating point |
bytes | variable | ByteBuffer | Arbitrary byte sequence |
string | variable | CharSequence / String | UTF-8 string |
4. Logical Data Types — Dates and Times
Avro extends the meaning of Primitive types by adding logicalType annotations on top of them. Date/time-related Logical Types are particularly diverse.
4.1 Date
{"name": "birth_date", "type": {"type": "int", "logicalType": "date"}}- Base type:
int - Meaning: Number of days since 1970-01-01
- Example:
19827→2024-04-12
4.2 Time
| logicalType | Base Type | Precision | Example Value |
|---|---|---|---|
time-millis | int | milliseconds | 43200000 → 12:00:00.000 |
time-micros | long | microseconds | 43200000000 → 12:00:00.000000 |
{"name": "login_time", "type": {"type": "int", "logicalType": "time-millis"}}
{"name": "precise_time", "type": {"type": "long", "logicalType": "time-micros"}}4.3 Timestamp
| logicalType | Base Type | Precision | Timezone |
|---|---|---|---|
timestamp-millis | long | milliseconds | UTC |
timestamp-micros | long | microseconds | UTC |
local-timestamp-millis | long | milliseconds | Local (no timezone info) |
local-timestamp-micros | long | microseconds | Local (no timezone info) |
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
{"name": "event_ts", "type": {"type": "long", "logicalType": "timestamp-micros"}}
{"name": "local_created", "type": {"type": "long", "logicalType": "local-timestamp-millis"}}
{"name": "local_event", "type": {"type": "long", "logicalType": "local-timestamp-micros"}}The difference between
timestamp-millisandtimestamp-microsis precision. millis uses1713945600000(13 digits), micros uses1713945600000000(16 digits). Choose based on the precision of your data source. For example, an RDBMSTIMESTAMP(6)has microsecond precision, sotimestamp-microsis appropriate.
local-timestamp-*was added in Avro 1.10+. Use it when you want to store the time "as-is" without timezone conversion.
4.4 Other Logical Types
| logicalType | Base Type | Description |
|---|---|---|
decimal | bytes or fixed | Fixed-point decimal. precision and scale are required |
uuid | string | RFC 4122 UUID |
duration | fixed(12) | months(4) + days(4) + millis(4) |
{"name": "price", "type": {"type": "bytes", "logicalType": "decimal", "precision": 10, "scale": 2}}
{"name": "uuid", "type": {"type": "string", "logicalType": "uuid"}}5. Complex Data Types
5.1 Array
{
"name": "tags",
"type": {
"type": "array",
"items": "string"
}
}items can be any type. Nested records are also possible:
{
"name": "addresses",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Address",
"fields": [
{"name": "city", "type": "string"},
{"name": "zipcode", "type": "string"}
]
}
}
}5.2 Map
Keys are always string, and the value type is specified with values.
{
"name": "metadata",
"type": {
"type": "map",
"values": "string"
}
}5.3 Enum
{
"name": "status",
"type": {
"type": "enum",
"name": "Status",
"symbols": ["ACTIVE", "INACTIVE", "SUSPENDED"]
}
}5.4 Fixed
A fixed-length byte array.
{
"name": "md5",
"type": {
"type": "fixed",
"name": "MD5",
"size": 16
}
}5.5 Union
Allows one of several types. Essential for creating nullable fields.
{"name": "middle_name", "type": ["null", "string"], "default": null}6. The Difference Between Nullable and Non-Nullable Fields
This is the most common mistake point in Avro schemas.
Non-nullable field
{"name": "name", "type": "string"}- A value is mandatory.
- If the value is
null, serialization throws an exception - When adding this field in schema evolution, previous data cannot be read (no default exists)
Nullable field
{"name": "name", "type": ["null", "string"], "default": null}- If there's no value, it's stored as
null - In the union notation
["null", "string"], the first type must match the type of the default - When adding this field in schema evolution, previous data can still be read (filled with null as default)
Order matters
// Correct: default is null, so "null" comes first
{"name": "email", "type": ["null", "string"], "default": null}
// Wrong: "string" is first but default is null
{"name": "email", "type": ["string", "null"], "default": null} // Error!Practical recommendation: Always use the
["null", "actual_type"]+"default": nullpattern for optional fields. This guarantees schema evolution compatibility.
Schema Evolution Compatibility Comparison
| Change | Non-nullable | Nullable (default: null) |
|---|---|---|
| Add new field → read previous data | Fails if no default | Succeeds, filled with null |
| Remove field → read new data | Ignored | Ignored |
| Writer/Reader schema mismatch | Fails strictly | Handles flexibly |
7. Avro Serialization / Deserialization in Java
7.1 Maven dependency
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.3</version>
</dependency>7.2 Schema definition and GenericRecord approach
This approach handles things dynamically without code generation.
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
public class AvroGenericExample {
public static void main(String[] args) throws Exception {
// 1. Define schema
String schemaJson = """
{
"type": "record",
"name": "User",
"namespace": "io.datadynamics.example",
"fields": [
{"name": "id", "type": "long"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null},
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
]
}
""";
Schema schema = new Schema.Parser().parse(schemaJson);
// 2. Serialize — Write Avro file
File file = new File("users.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
try (DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<>(writer)) {
fileWriter.create(schema, file);
GenericRecord user = new GenericData.Record(schema);
user.put("id", 1L);
user.put("name", "John Doe");
user.put("email", "john@example.com");
user.put("created_at", System.currentTimeMillis());
fileWriter.append(user);
GenericRecord user2 = new GenericData.Record(schema);
user2.put("id", 2L);
user2.put("name", "Jane Smith");
user2.put("email", null); // nullable field
user2.put("created_at", System.currentTimeMillis());
fileWriter.append(user2);
}
// 3. Deserialize — Read Avro file
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
try (DataFileReader<GenericRecord> fileReader = new DataFileReader<>(file, reader)) {
while (fileReader.hasNext()) {
GenericRecord record = fileReader.next();
System.out.println("ID: " + record.get("id")
+ ", Name: " + record.get("name")
+ ", Email: " + record.get("email")
+ ", Created: " + record.get("created_at"));
}
}
}
}7.3 Code generation (SpecificRecord) approach
Use avro-maven-plugin to auto-generate Java classes from .avsc files.
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.11.3</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals><goal>schema</goal></goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>Using the generated classes provides type safety:
// Using auto-generated User class
User user = User.newBuilder()
.setId(1L)
.setName("John Doe")
.setEmail("john@example.com")
.setCreatedAt(System.currentTimeMillis())
.build();7.4 Serialization to byte array (for Kafka, etc.)
Use this when converting to a byte array instead of a file.
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.BinaryDecoder;
import java.io.ByteArrayOutputStream;
// Serialize
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
writer.write(record, encoder);
encoder.flush();
byte[] bytes = out.toByteArray();
// Deserialize
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
GenericRecord result = reader.read(null, decoder);8. Advantages of Avro Files
| Advantage | Description |
|---|---|
| Embedded schema | Schema is included in the file header, eliminating the need for separate schema management |
| Compact binary | 50-80% size reduction compared to JSON |
| Schema evolution | Forward/Backward/Full compatibility support |
| Fast serialization | Performance comparable to Protobuf |
| Compression support | Built-in block compression with Snappy, Deflate, Zstandard |
| Splittable | Can be split by block for parallel processing on HDFS (MapReduce, Spark) |
| Multi-language support | Libraries available for Java, Python, C, C++, Go, and more |
| Hadoop ecosystem integration | Native support in Hive, Pig, Spark, Kafka, NiFi, and more |
9. Disadvantages and Caveats of Avro Files
9.1 Not human-readable
As a binary format, contents cannot be inspected with tools like cat or less. Debugging requires JSON conversion with avro-tools.
java -jar avro-tools-1.11.3.jar tojson users.avro9.2 Schema evolution pitfalls
- Field renaming breaks compatibility. Avro matches by field name, so rename = delete + add
- Be careful with Union type changes:
["null", "string"]→["null", "int"]is not compatible - Adding a Non-nullable field without default makes previous data unreadable
aliasescan work around field name changes, but all consumers must recognize the new schema
9.3 Dangers of Enum changes
Deleting an existing symbol from an Enum makes it impossible to deserialize previous data containing that value. Only add Enum symbols — never delete them.
9.4 Precision issues with decimal type
bytes-based decimal has variable length, requiring additional processing for sortingfixed-based decimal requires calculating the exact size for the precision- Precision/scale mismatches when integrating with Hive/Spark can cause data truncation or exceptions
9.5 Large collection performance
If an Array or Map contains tens of thousands or more elements, the serialization/deserialization cost for a single record spikes. In such cases, normalizing into separate records is better.
9.6 Timestamp precision mismatch
If the source system has nanosecond precision but the Avro schema uses timestamp-millis, data loss occurs. Conversely, storing millisecond data as timestamp-micros results in unnecessarily large values. Always verify the source's precision first.
9.7 Writer/Reader schema mismatch
Avro allows the writer schema and reader schema to differ, but incompatible changes cause an AvroTypeException at runtime. Using a Schema Registry to validate compatibility in advance is recommended.
10. Issues When Using Avro Schema with NiFi
10.1 AvroSchemaRegistry and schema text
When registering schemas in NiFi's AvroSchemaRegistry, JSON string escaping issues frequently occur. Pay special attention to line breaks getting corrupted or quotes being converted during copy/paste in the UI.
10.2 Null handling in ConvertRecord
When converting CSV to Avro with the ConvertRecord processor, CSV empty strings ("") are handled differently from Avro null.
- When an empty string arrives for a field defined as
["null", "string"], it's stored as""rather thannull - Conversely, if the CSV column is completely missing and the schema is Non-nullable, the conversion fails
10.3 Timestamp conversion issues
When using ConvertRecord or UpdateRecord in NiFi:
- Conversion between source data date formats (e.g.,
"2026-04-12 15:30:00") and Avro'stimestamp-millis(epoch milliseconds) may not happen automatically - You may need to handle epoch conversion manually with
RecordPathor a separateUpdateAttribute - Timezone settings follow the JVM default, so results may vary across servers
10.4 FlowFile compatibility during schema evolution
When you change an Avro schema mid-pipeline in NiFi:
- FlowFiles already queued are serialized with the previous schema
- When a processor with the changed schema reads these FlowFiles,
AvroTypeExceptionor missing fields occur - Drain the queue before changing the schema, or use the "Embedded Avro Schema" option in the Reader
10.5 Memory issues with large Avro records
NiFi's Record-based processors load entire FlowFiles into memory for processing. If Avro records contain large bytes fields or Arrays with tens of thousands of elements, NiFi node heap memory can be exhausted.
10.6 Schema Access Strategy configuration errors
When the Schema Access Strategy setting in NiFi's Record Reader/Writer doesn't match, schemas cannot be found.
Use Embedded Avro Schema: Uses the schema from the Avro file headerUse Schema Name: Looks up by name from AvroSchemaRegistryUse Schema Text: Schema JSON entered directly in the processor
Mixing these or selecting the wrong one results in "Schema not found" or "Unable to obtain schema" errors.
10.7 decimal logicalType support issues
In some NiFi versions, when processing the decimal logicalType, precision/scale may be dropped, or values may be corrupted during bytes ↔ BigDecimal conversion. For financial data where precision matters, always validate values before and after conversion.
Summary
Avro Schema is one of the most widely used serialization formats in the big data ecosystem. Thanks to its three strengths — embedded schema, schema evolution, and binary compression — it's well-suited for large-scale data pipelines. However, its binary nature makes debugging difficult, and without a precise understanding of schema evolution rules, you'll encounter data compatibility issues in production. Pay particular attention to null handling, timestamp conversion, and Schema Access Strategy settings when integrating with NiFi.
— Data Dynamics Engineering Team