Documentation is not something people often spend time reading, or if they do then its to quickly find the one thing their after and then get out as quickly as possible, very similar to how I do my Christmas shopping. Sometimes it's worth spending time reading the documentation though as there can be some useful bits of information hidden in summary descriptions, links etc...
One such item is the Azure Data Lake Store client. If you find yourself reading or writing a lot of files and your doing it in multiple tasks (or threads, but you should be using Tasks if possible), then reading the docs can really help you out. For instance this snippet taken from the description at the top of the documentation page.
If an application wants to perform multi-threaded operations using this SDK it is highly recomended to set ServicePointManager.DefaultConnectionLimit to the number of threads application wants the sdk to use before creating any instance of AdlsClient. By default ServicePointManager.DefaultConnectionLimit is set to 2.
Okay, so how bad can things be if you don't read this? Well, to answer that I created an ADLS instance and uploaded a number of small parquet files. Then wrote an application to read each file (using the excellent Parquet .NET) and return the number of records in the file, each file is processed in it's own Task and each uses the same AdlsClient instance.
The simple process being followed here is to get a list of files, call "ProcessPath" on each and then when all the files have been process output the results.
The output of this initial version is as follows:
It's not too bad, but with multiple tasks I would have expected it to be better. Looking at the documentation snippet above it suggests we need to change the ServicePointManager.DefaultConnectionLimit value, but what to? Well doing some digging around came across a suggestion from Microsoft Support which, for ASP.NET, is to limit the number of requests that can execute at the same time to 12 per CPU (or 12 per core). So let's give that a go and see what happens.
The code change for this is pretty simple and we can use System.Environment to get the number of processors available.
So does it make much of a difference?
Well, yes, quite a lot of difference actually. I ran the code in both variations a few more times to check it wasn't intermittent networking issues, other processes on my laptop interfering etc... but no, it really does make that much of a difference.
So next time you're working with multiple tasks sharing resources, maybe spend a bit of time reading the documentation to see if there's anything which can make a difference to your application.