Data Science

Why You Should Not Trust JSON Decoding

2025-03-21

Post by:

JSON Decoding

So there I was, sitting at my desk, coffee in hand, thinking I knew everything I needed about decoding JSON. Just another Thursday evening the pipeline ran reliably as always, processing hundreds of millions of metrics per hour. Until JSON decided to humble me.

Let me set the scene. I’ve been a data engineer for almost 6 years now. I’ve seen things. Data arriving that never should be able to be created in the first place. Changes breaking pipelines that happen at 4:59 PM on a Friday. But nothing—NOTHING—prepared me for the betrayal that came from my old friend json.loads().

The Incident

It was a simple task. Parse some JSON that gets created by a cloud transformation, parse some fields, transform it, and push it downstream. Easy peasy, right? The code looked something like this:

import json

def transform_json_metrics_string(
    metrics_name: str,
    metrics_string: str,
    schema: dict,
) -> tuple[dict, set[str]]:

    # Parse raw json string into untyped python dict
    try:
        metrics_raw = json.loads(metrics_string)
    except ValueError:
        return {}, {f"{metrics_name}.: contained invalid json string"}
    except TypeError:
        return {}, {f"{metrics_name}.: contained invalid type in json string"}

    # Type cast according to agreed schema
    metrics, warnings = type_cast_metrics_dict(metrics_name, metrics_raw, schema)

    return metrics, warnings

I’d written this pattern a thousand times before. Get the JSON, decode it, access the fields by key. What could go wrong?

Well, let me tell you what went wrong.

Exceeded max_retries of 3
The above exception was caused by the following exception:
AttributeError: 'str' object has no attribute 'items'

Stack Trace:
  File "/app/data_management_pipelines/assets/idd/ingest.py", line 139, in load_file_to_clickhouse
    ch_dataframe = pd.DataFrame.from_records(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "/app/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 2461, in from_records
    values += data
,  File "/app/data_management_pipelines/utils/transformation.py", line 382, in transform_rows
    transformed_row = transform_row(row, schema_dict, warnings)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "/app/data_management_pipelines/utils/transformation.py", line 409, in transform_row
    metrics, new_warnings = transform_json_metrics_string(row.metricstype, row.log, schema)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "/app/data_management_pipelines/utils/transformation.py", line 361, in transform_json_metrics_string
    metrics, warnings = type_cast_metrics_dict(metrics_type, metrics_raw, schema)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "/app/data_management_pipelines/utils/transformation.py", line 234, in type_cast_metrics_dict
    for name, value in metrics_raw.items():
                       ^^^^^^^^^^^^^^^^^

Wait, what? Where’s my dictionary? Where are my key-value pairs? WHERE IS MY STRUCTURED DATA?!

The Terrible Truth

After some quick debugging (and a terrible realization), I noticed my mistake, which should have been obvious: JSON isn’t just dictionaries. Oh no, I just assumed that it would be the only thing sent my way. Valid JSON can of course also be an array, a string, a number, a boolean, or even null.

2 rows out of more than 100 million contained what was a valid JSON, but it happened to be an escaped JSON string. I.e. it gets interpreted as a Python string instead of a dict.

>>> import json
>>> s = r'"{\"reason\": \"Expecting value: line 1 column 1 (char 0)\"}"'
>>> json.loads(s)
'{"reason": "Expecting value: line 1 column 1 (char 0)"}'
>>> type(s)
<class 'str'>

The confidence in all of my pipelines took a deep-dive. All those times I trustingly called json.loads() assuming it would give me a dictionary or throw an exception…

As a Data Engineer, you learn very quickly not to trust the data inputs to your pipelines. Always validate the inputs! I thought I had validated it enough. It was valid JSON, right! Yes, but that is not enough for my use-case!

Understanding JSON

JSON (JavaScript Object Notation) was created by Douglas Crockford in the early 2000s. The key thing to understand: JSON wasn’t designed primarily as a data interchange format for complex systems. It was designed as a subset of JavaScript that could represent JavaScript data structures. In JavaScript, primitive values like strings, numbers, and booleans are perfectly valid javascript data types as well. When Crockford was formalizing JSON, he wasn’t thinking about data engineers who would someday need to reliably extract nested fields from complex structures. He was thinking about representing JavaScript values in a language-independent way.The official JSON specification makes this clear:

A JSON text is a serialized value. […] A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false, null, true.

Did you catch that? A value. Not necessarily an object or an array. A value can be a string, number, boolean, null, object, or array. This makes perfect sense in the JavaScript world, where JSON.parse() is essentially just reversing JSON.stringify(). Python behaves accordingly.

>>> json.loads('123')
123
>>> json.loads('[1,2,3]')
[1, 2, 3]
>>> json.loads('123.12312')
123.12312
>>> type(json.loads('null'))
<class 'NoneType'>

But that is not how I remember it! Some readers might say! Well, the specification actually contains another interesting paragraph.

A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array. Implementations that generate only objects or arrays where a JSON text is called for will be interoperable in the sense that all implementations will accept these as conforming JSON texts.

Aha! Perhaps I was not as careless as I thought! I simply acted on outdated information!

The Expectation Gap

Enough rambling, what I want to say is. In data engineering, we typically think of data interchange formats as structured containers – maps, objects, dictionaries, whatever you want to call them. We expect key-value pairs. We expect nesting. We expect to be able to access things.

But JSON doesn’t care about our expectations. JSON is just saying, “I promised to represent any JavaScript value, and guess what? "{\"reason\": \"Expecting value: line 1 column 1 (char 0)\"}" is a completely valid JavaScript value!”

It’s Not Just Python

To be clear, this is not just limited to Python!

In JavaScript: JSON.parse('"just a string"') // Returns the string "just a string"

In Ruby: JSON.parse('"surprise!"') # Returns "surprise!"

In Go: json.Unmarshal([]byte(`"gotcha!"`), &result) // result becomes "gotcha!"

The Hard Lessons

So what did I learn from all of this?

  1. Always, ALWAYS validate your JSON structure before accessing it.
  2. Never assume json.loads() will return a dictionary.
  3. Questions like “Is this valid JSON?” and “Is this JSON an object I can access with keys?” are completely different questions.

My code now looks like this:

import json

def transform_json_metrics_string(
    metrics_name: str,
    metrics_string: str,
    schema: dict,
) -> tuple[dict, set[str]]:

    # Parse raw json string into untyped python dict
    try:
        metrics_raw = json.loads(metrics_string)
    except ValueError:
        return {}, {f"{metrics_name}.: contained invalid json string"}
    except TypeError:
        return {}, {f"{metrics_name}.: contained invalid type in json string"}
    if not isinstance(metrics_raw, dict):
        return {}, {f"{metrics_type}.: json did not include object (dict)."}

    # Type cast according to agreed schema
    metrics, warnings = type_cast_metrics_dict(metrics_name, metrics_raw, schema)
    return metrics, warnings

It’s not pretty. It’s not elegant. But it works, and it doesn’t blow up when I get a string instead of an object.

Tags