U-SQL Output in Azure Data Lake

azure output azure-data-lake u-sql data-partitioning

This is our top ask (and has been previously asked on stackoverflow too :). We are currently working on it and hopefully have it available by summer.

Until then you have to write a script generator. I tend to use U-SQL to generate the script but you could do it with Powershell or T4 etc.

Here is an example:

Let's assume you want to write files for the column name in the following table/rowset @x:

name | value1 | value2-----+--------+-------A    | 10     | 20A    | 11     | 21B    | 10     | 30B    | 100    | 200

You would write a script to generate the script like the following:

@x = SELECT * FROM (VALUES( "A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);// Generate the script to do partitioned output based on name column:@stmts =   SELECT "OUTPUT (SELECT value1, value2 FROM @x WHERE name == \""+name+"\") TO \"/output/"+name+".csv\" USING Outputters.Csv();" AS output   FROM (SELECT DISTINCT name FROM @x) AS x;OUTPUT @stmts TO "/output/genscript.usql" USING Outputters.Text(delimiter:' ', quoting:false);

Then you take genscript.usql, prepend the calculation of @x and submit it to get the data partitioned into the two files.

azure output azure-data-lake u-sql data-partitioning

There's a new feature in public preview:

SET @@FeaturePreviews = "DataPartitionedOutput:on";

You can add it at the beginning of the script, and the output data can be partitioned by the key you choose:

SET @@FeaturePreviews = "DataPartitionedOutput:on";@result =SELECT date.Hour AS hour,       COUNT( * ) AS total,       ANY_VALUE(message) AS sample_message,       ANY_VALUE(stack_trace) AS sample_traceFROM @dataWHERE date.Year == 2018      AND date.Month == 7GROUP BY date.HourHAVING COUNT( * ) > 80;OUTPUT @resultTO "/output/parquetfiles-{hour}.csv"ORDER BY hour DESC,     total DESCUSING Outputters.Csv(outputHeader : true);

Another example can be found in article
Process more files than ever and use Parquet with Azure Data Lake Analytics
section "Putting it all together in a simple end-to-end example".

azure output azure-data-lake u-sql data-partitioning

Great question! I'll be interested to see what Mr Rys responds with.

Apologies, but this is only half an answer.

My first thoughts are to partition an ADL table using your key value. But then I'm not sure how you'd deal with the separate outputs if a potential WHERE clause isn't deterministic. Maybe CROSS JOIN in every result and .... pass!

It would be nice to have a WHILE loop with some dynamic code!

Check out this post on the MS forums that talks about dynamic input datasets. Just as an FYI.

https://social.msdn.microsoft.com/Forums/en-US/aa475035-2d57-49b8-bdff-9cccc9c8b48f/usql-loading-a-dynamic-set-of-files?forum=AzureDataLake

CodeHunter

U-SQL Output in Azure Data Lake

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last