Spark pulling data into RDD or dataframe or dataset

hadoop apache-spark apache-spark-sql spark-dataframe data-ingestion

Regarding 1

Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:

All the transformation operations like (map, filter, etc.) are lazy.This means that no reading will be performed unless you require aconcrete result of your operations (like reduce, fold or save theresult to some file).
When processing a file on HDFS Spark operateswith file partitions. Partition is a minimal logical batch of datathe can be processed. Normally one partition equals to one HDFSblock and the total number of partitions can never be less thennumber of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD andDataset are performed inside of executors and never on driver. Drivercreates a DAG and logical plan of execution and assigns tasks toexecutors for further processing.
Each executor runs the previouslyassigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.

So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.

There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.

Regarding 2

In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.

Regarding 3

The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

CodeHunter

Spark pulling data into RDD or dataframe or dataset

Regarding 1

Regarding 2

Regarding 3

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last