ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files#10266
ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files#10266amol- wants to merge 8 commits intoapache:masterfrom amol-:ARROW-12650
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
docs/source/python/memory.rst
Outdated
There was a problem hiding this comment.
This is rather misleading. The data is loaded back to memory when it is being read. It's just that it's read lazily, so the costs are not paid up front (and the cost is not paid for data that is not accessed).
What memory mapping can avoid is an intermediate copy when reading the data. So it is more performant in that sense.
There was a problem hiding this comment.
I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way.
There was a problem hiding this comment.
I rephrased it to make it more clear that in absolute terms you won't be consuming fewer memory, but the system will be able to more easily page it out.
There was a problem hiding this comment.
Would memory mapping be more efficient than a system with swap enabled? You mention that there are potential write back savings but why would the page be flagged as dirty in a swap scenario? In either case it seems we are talking about read only access to a larger than physical memory file.
There was a problem hiding this comment.
there are benefits because if you are only reading the data (suppose to compute means or whatever on it) if you are using memory mapping and you have to read more data than it fits in your memory, the kernel can swap out the pages no longer in use without any cost of writing them into the swap, because they are already available in the file that was memory mapped and thus can be paged back in directly from the memory mapping.
On the other side, if you were relying on swap and had read the file normally, when the data you have to read doesn't fit into memory, the kernel will have to incur into the cost of writing it to the swap, otherwise it would be unable to page it out as there wouldn't be any copy (as far as the memory manager is concerned) that would allow to page it back in.
So memory mapping prevents the cost of writing to the swap file when you are exhausting memory.
docs/source/python/memory.rst
Outdated
There was a problem hiding this comment.
I don't know what user we target here, but "page in" and "page out" is not commonly understood I think (of course, we can't start explaining in detail how memory works here, but I think this section will be typically read by people who might not fully understand what memory mapping is / how it works, but just think they can use it to avoid memory allocation).
There was a problem hiding this comment.
I rephrased it this way mostly because there were some concerns in previous comments about the usage of the the "avoid memory allocation" wording as the memory is getting allocated anyway, it can just be swapped out at any time without any write back additional cost and thus you can avoid OOMs even if you exhausted memory or swap space.
|
@jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you take another look? :D |
pitrou
left a comment
There was a problem hiding this comment.
This looks much better to me, thank you.
| For example to write an array of 10M integers, we could write it in 1000 chunks | ||
| of 10000 entries: | ||
|
|
||
| .. ipython:: python |
There was a problem hiding this comment.
I'm still lukewarm about using ipython blocks here.
There was a problem hiding this comment.
I'm not fond of ipython directive too, but we have a dedicated Jira Issue ( https://issues.apache.org/jira/browse/ARROW-13159 ), for now I adhered to what seemed to be the practice in the rest of that file.
Uh oh!
There was an error while loading. Please reload this page.