TABLE OF CONTENTS SharePoint vision | Content repository | Content dilemmas | Storage deep dive | Business goa...
So far in this excerpt, I’ve touched briefly on how SharePoint stores its data, but if we’re going to make any headway in optimizing our SharePoint storage, we need to understand that storage mechanism in much greater detail. Here’s a deep dive on how SharePoint storage works.
How SQL Server stores data
SQL Server consists of a service, which opens database files on disk. You can think of the database file as a kind of proprietary second-level storage system, meaning it is a way for SQL Server to organize data in a manner that facilitates SQL Server’s job and performance goals. The database itself sits on a disk drive, and access to the file is made through Windows’ own file systems. Figure 1.6 outlines the high-level structure.
Figure 1.6: High-level SQL Server storage.
As I already described, SQL Server stores data in 8KB chunks called pages. That is strictly a SQL Server data-management paradigm; the actual data is still written to the disk in the form of disk blocks, which are usually smaller than 8KB. For example, if the drive was formatted to use 1KB disk blocks, SQL Server would be writing eight of those blocks to the file system each time it saved a page to the database. Figure 1.7 illustrates this deeper-level look at the storage architecture.
Figure 1.7: Storing data in pages and disk blocks.
This form of storage can have ever-deeper impacts on SQL Server’s performance. As I already explained, SQL Server generally requires that a single row of database data live within a single 8KB page; smaller rows can share a page. Because SQL Server reads data in 8KB chunks, storing more rows per page means that SQL Server can read more data in a single operation. Conversely, a data row occupying 4.1KB can result in a lot of wasted disk throughput because SQL Server can only fit a single such page on an 8KB row but must read and write that entire 8KB, even though only slightly more than half of the 8KB actually consists of data.
- Smaller disk blocks mean less wasted space but require more work for Windows to read and write when large, multi-block files are involved.
- Larger disk blocks mean the potential for more wasted space but allow Windows to read and write larger files in fewer discrete operations.
For a volume containing SQL Server databases, it’s almost ideal to use an 8KB cluster size, as this aligns SQL Server’s own storage with the file system’s smallest unit of work. Some experts recommend a larger cluster size of 64KB, meaning every file system-level disk read will pick up eight SQL Server pages.
The point is that the Windows file system is really good at reading large blocks of disk space into memory, and at writing changes to large blocks of disk space. You have a lot of flexibility to optimize the file system’s behavior in this regard, depending on the size of files you’re working with. Really, the high-level lesson is that the file system is very good at storing files. It does not necessarily need another layer added atop it if all you’re doing is storing large items like file attachments.
If you’re going to be storing relational data, such as an actual database, a different means of organizing that data can vastly improve performance -- which is why SQL Server has its own database structure rather than just storing data in tiny little files all over the disk. But that doesn’t mean an intermediate storage layer like SQL Server is always going to offer the best performance.
SharePoint: All about the BLOBs
When SQL Server needs to store a large mass of data -- such as a file attachment -- it does so in the form of a BLOB. BLOBs consist of a link, or pointer, within the actual row data. That link or pointer then connects to one or more 8KB pages that store the actual BLOB data. So, in SharePoint, suppose that a single document entry takes up a single 8KB page for the entry itself. If a document entry includes a 50MB PowerPoint file attachment, the total entry will consist of more than 6000 pages in the database.
SQL Server will never need to read a portion of those 6000 pages. When you retrieve the PowerPoint file, you’re going to retrieve all of it. That means those 6000 pages will only ever be read sequentially, all at once. The problem is that they might not be stored sequentially. SQL Server writes pages to the database beginning in the first available spot, so those 6000 pages may require SQL Server to jump around a bit, finding free spaces for all of them.
The result is a fragmented database, meaning SQL Server will need to jump around within the database file to re-assemble that PowerPoint attachment. Further, the actual disk blocks storing those pages might not be contiguous (and that’s often the case when a database grows beyond its initial size), so the operating system (OS) may have to jump around quite a bit to piece together those 8KB pages. All that disk I/O, at both the database and file system level, can slow SQL Server a bit. Although SQL Server is designed to perform this kind of operation, when it has to do so for hundreds or thousands of users at once, its performance definitely sees an impact.
The fact is that most of SharePoint’s storage will be given over to BLOBs. If you import 5TB of files, you’re going to be using 5TB of BLOB-style storage (potentially more, actually, because having to store the data in 8KB chunks will usually result in some “wasted space” at the end of each BLOB sequence). You’ll also be adding the SharePoint data entries to keep track of those files, but as a percentage of the total data, those entries are negligible. In most SharePoint installations, upwards of 90% of your database size will be given over to BLOBs, so figuring out how to optimize that storage can have a significant impact.
Why BLOBs are bad for databases
Keep in mind that SQL Server’s main design point is to read and write single pages of data, or at the worst, a few pages of data, at once. Most SharePoint operations -- updating a document’s permissions, or changing its name -- require that very few database pages be modified, and SQL Server does a great job of it. When SQL Server has to start behaving like a file system, however, it’s not working in its best “comfort zone,” and so you can start to see performance differences.
In fact, when it comes to dealing with large, sequential data structures -- like file attachments -- SQL Server is essentially adding an unnecessary second layer to the equation. Because those file attachment BLOBs aren’t the kind of structured, relational data that SQL Server is designed for, SQL Server really is taking on some of the attributes of a file system -- but it’s also sitting on top of a file system.
One trick, then, is to offload the BLOBs to the file system, which excels at moving huge lumps of data from place to place. The file system is already involved in SQL Server’s BLOB manipulation, so taking SQL Server “out of the stack” can help improve performance.
Why we put BLOBs into SharePoint
Let’s consider every file in a SharePoint document library to consist of two main parts: The file metadata and the file attachment itself. The metadata consists of things like keywords, permissions, update dates and times, and so forth; the file itself is the actual file data stored in a Word file, a PowerPoint file, or whatever.
The metadata provides most of SharePoint’s features, allowing it to coordinate workflows, maintain version information, send alerts, and so forth. But SharePoint also needs access to the actual file because its search indexing engine wants to open those files and scan them for keywords. By doing so, it builds a search index, which is employed to help users quickly locate files by using keywords. So although it is technically possible to separate the metadata from the file itself, it isn’t desirable to do so unless that separation can be done in a way that still provides SharePoint’s indexing engine access to the file.
In fact, Microsoft’s official BLOB-offloading technologies seek to do just that. They essentially wedge themselves into SQL Server’s brain so that when SQL Server needs to store or retrieve a BLOB, the BLOB data goes elsewhere rather than into the database itself. Because this “wedge” occurs within SQL Server, applications -- like SharePoint -- don’t need to realize that it’s happening. They “think” the BLOBs live in the database, and SQL Server makes the BLOBs available as if they were in the database, so SharePoint can continue working with all the files as if they were in the database -- even though they’re not.
But that’s not necessarily the only approach to reducing the size of the SharePoint database and improving SharePoint’s database performance. In fact, one reason BLOB offloading isn’t the “perfect solution” is because it still requires that content be migrated into SharePoint to begin with -- and that migration project might be one that you want to avoid, if possible.
|This chapter is an excerpt from the book, Intelligently Reducing SharePoint Costs Through Storage Optimization, authored by Don Jones, and published by Realtime Publishers, November 2010, ISBN 978-1-935581-25-3, Copyright 2010 by Realtime Publishers. Download the complete book for free at Realtime Nexus Digital Library.|