Parquet without Hadoop?

hadoop hdfs parquet

Investigating the same question I found that apparently it's not possible for the moment.I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.

In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.

EDIT:

Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:

make it easy to read and write parquet files in java without depending on hadoop

hadoop hdfs parquet

Since it is just a file format it is obviously possible to decouple parquet from the Hadoop ecosystem. Nowadays the simplest approach I could find was through Apache Arrow, see here for a python example.

Here a small excerpt from the official PyArrow docs:

Writing

In [2]: import numpy as npIn [3]: import pandas as pdIn [4]: import pyarrow as paIn [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],   ...:                    'two': ['foo', 'bar', 'baz'],   ...:                    'three': [True, False, True]},   ...:                    index=list('abc'))   ...: In [6]: table = pa.Table.from_pandas(df)In [7]: import pyarrow.parquet as pqIn [8]: pq.write_table(table, 'example.parquet')

Reading

In [11]: pq.read_table('example.parquet', columns=['one', 'three'])

EDIT:

With Pandas directly

It is also possible to use pandas directly to read and write DataFrames. This makes it as simple as my_df.to_parquet("myfile.parquet") and my_df = pd.read_parquet("myfile.parquet")

hadoop hdfs parquet

What type of data do you have in Parquet? You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables. We do not ship with a dependency on HDFS, however, you can store the files on HDFS if you want. Obviously, we at Incorta can read directly from the parquet files, but you can also use Apache Drill to connect, use file:/// as the connection and not hdfs:/// See below for an example.

To read or write Parquet data, you need to include the Parquet format in the storage plugin format definitions. The dfs plugin definition includes the Parquet format.

{  "type" : "file",  "enabled" : true,  "connection" : "file:///",  "workspaces" : {  "json_files" : {  "location" : "/incorta/tenants/demo//drill/json/",  "writable" : false,  "defaultInputFormat" : json  } },

CodeHunter

Parquet without Hadoop?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last