How to convert an 500GB SQL table into Apache Parquet?

Apache Spark can be used to do this:

1.load your table from mysql via jdbc2.save it as a parquet file

Example:

from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})df.write.parquet("YOUR_HDFS_FILE")

mysql sql-server hadoop parquet

The odbc2parquet command line tool might also be helpful in some situations.

odbc2parquet \-vvv \ # Log output, good to know it is still doing something during large downloadsquery \ # Subcommand for accessing data and storing it--connection-string ${ODBC_CONNECTION_STRING} \--batch-size 100000 \ # Batch size in rows--batches-per-file 100 \ # Ommit to store entire query in a single fileout.par \ # Path to output parquet file"SELECT * FROM YourTable"

CodeHunter

How to convert an 500GB SQL table into Apache Parquet?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last