Reading JSON files#
Arrow supports reading columnar data from line-delimited JSON files. In this context, a JSON file consists of multiple JSON objects, one per line, representing individual data rows. For example, this file represents two rows of data with four columns "a", "b", "c", "d":
{"a":1,"b":2.0,"c":"foo","d":false} {"a":4,"b":-5.5,"c":null,"d":true}
The features currently offered are the following:
multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension, such as
my_data.json.gz
)sophisticated type inference (see below)
Note
Currently only the line-delimited JSON format is supported.
Usage#
JSON reading functionality is available through the pyarrow.json
module.
In many cases, you will simply call the read_json()
function
with the file path you want to read from:
>>> frompyarrowimport json >>> fn = 'my_data.json' >>> table = json.read_json(fn) >>> table pyarrow.Table a: int64 b: double c: string d: bool >>> table.to_pandas() a b c d 0 1 2.0 foo False 1 4 -5.5 None True
Automatic Type Inference#
Arrow data types are inferred from the JSON types and values of each column:
JSON null values convert to the
null
type, but can fall back to any other type.JSON booleans convert to
bool_
.JSON numbers convert to
int64
, falling back tofloat64
if a non-integer is encountered.JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert to
timestamp[s]
, falling back toutf8
if a conversion error occurs.JSON arrays convert to a
list
type, and inference proceeds recursively on the JSON arrays’ values.Nested JSON objects convert to a
struct
type, and inference proceeds recursively on the JSON objects’ values.
Thus, reading this JSON file:
{"a":[1,2],"b":{"c":true,"d":"1991年02月03日"}} {"a":[3,4,5],"b":{"c":false,"d":"2019年04月01日"}}
returns the following data:
>>> table = json.read_json("my_data.json") >>> table pyarrow.Table a: list<item: int64> child 0, item: int64 b: struct<c: bool, d: timestamp[s]> child 0, c: bool child 1, d: timestamp[s] >>> table.to_pandas() a b 0 [1, 2] {'c': True, 'd': 1991年02月03日 00:00:00} 1 [3, 4, 5] {'c': False, 'd': 2019年04月01日 00:00:00}
Customized parsing#
To alter the default parsing settings in case of reading JSON files with an
unusual structure, you should create a ParseOptions
instance
and pass it to read_json()
. For example, you can pass an explicit
schema in order to bypass automatic type inference.
Similarly, you can choose performance settings by passing a
ReadOptions
instance to read_json()
.
Incremental reading#
For memory-constrained environments, it is also possible to read a JSON file
one batch at a time, using open_json()
.
In this case, type inference is done on the first block and types are frozen afterwards.
To make sure the right data types are inferred, either set
ReadOptions.block_size
to a large enough value, or use
ParseOptions.explicit_schema
to set the desired data types explicitly.