Split a large json file into multiple smaller files Split a large json file into multiple smaller files json json

Split a large json file into multiple smaller files


Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.

If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:

If you go with Python, there are streaming JSON parsers here as well:


Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.

  1. JSON/Parquet/Avro/XML is considered as semi-structure data
  2. They are stored as Variant data type in Snowflake.
  3. While loading the JSON data into stage location, flag the strip_outer_array=true

    copy into <table>from @~/<file>.jsonfile_format = (type = 'JSON' strip_outer_array = true);

  4. Each row size can not exceed 16Mb compressed when loaded in snowflake.

  5. Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.

Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.

As per your data set size, you will get around 31K small files (of 100Mb size).

  • It means that the 31k parallel process run, however, it is not possible.
  • So choose an x-large size warehouse (16 v-core & 32 threads)
  • 31k/32 = (approximately) 1000 rounds
  • This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.

Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.enter image description hereenter image description here