How can I quickly create large (>1gb) text+binary files with "natural" content? (C#) How can I quickly create large (>1gb) text+binary files with "natural" content? (C#) windows windows

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)


For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

Edit

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.


You could always code yourself a little web crawler...

UPDATECalm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".

A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.


I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.