Polars - Reading JSON Files
Polars, a high-performance DataFrame library, excels at handling JSON data. In this article, we’ll explore how to efficiently read JSON strings and files using Polars.
Converting JSON String Columns to Dictionaries
Often, DataFrames contain columns with JSON strings. Suppose you want to filter this DataFrame based on specific keys or values within the JSON strings. The most robust approach is to convert the JSON strings to dictionaries.
However, Polars doesn’t work with standard dictionaries. Instead, it uses the concept of “structs,” where each dictionary key maps to a struct’s “field name,” and the corresponding dictionary value becomes the “struct value.” Additionally, there are two constraints for creating a struct type:
All structs must have the same field names. Field names must be listed in the same order. But don’t worry! Polars provides the json_path_match function, which extracts values based on JSONPath syntax. This allows you to check if a key exists and retrieve its value.
Here’s how you can do it:
import polars as pl
json_list = [
"""{"name": "Maria", "position": "developer", "office": "Seattle"}""",
"""{"name": "Josh", "position": "analyst", "termination_date": "2020-01-01"}""",
"""{"name": "Jorge", "position": "architect", "office": "", "manager_st_dt": "2020-01-01"}""",
]
df = pl.DataFrame({"tags": json_list}).with_row_count("id", 1)
df = df.with_columns([
pl.col('tags').str.json_path_match(r"$.name").alias('name'),
pl.col('tags').str.json_path_match(r"$.office").alias('location'),
pl.col('tags').str.json_path_match(r"$.manager_st_dt").alias('manager start date'),
])
# json_path_match returns null if the key is not found
df = df.filter(pl.col('tags').str.json_path_match(r"$.manager_st_dt").is_not_null())
In the example above, we create a DataFrame with a column named “tags” containing JSON strings. We use the json_path_match function to extract specific values, creating new columns (“name,” “location,” and “manager start date”).
Reading Large JSON Files as DataFrames
When dealing with large JSON files, Polars infers the schema from the first 1000 rows. However, if you encounter a different schema later in the file, you might face errors. To prevent this, set infer_schema_length to None to scan the entire data (though this can be slow). This ensures accurate schema recognition.
df = pl.read_json("large_file.json", infer_schema_length=None)
Publish Date: 2024-05-08, Update Date: 2024-05-08