How to convert an 500GB SQL table into Apache Parquet? How to convert an 500GB SQL table into Apache Parquet? hadoop hadoop

How to convert an 500GB SQL table into Apache Parquet?


Apache Spark can be used to do this:

1.load your table from mysql via jdbc2.save it as a parquet file

Example:

from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})df.write.parquet("YOUR_HDFS_FILE")


The odbc2parquet command line tool might also be helpful in some situations.

odbc2parquet \-vvv \ # Log output, good to know it is still doing something during large downloadsquery \ # Subcommand for accessing data and storing it--connection-string ${ODBC_CONNECTION_STRING} \--batch-size 100000 \ # Batch size in rows--batches-per-file 100 \ # Ommit to store entire query in a single fileout.par \ # Path to output parquet file"SELECT * FROM YourTable"