Chapter 8: Storage

CT 320: Network and System Administration

Colorado State University

Computer Science Department

Original slides from Dr. James Walden at Northern Kentucky University.

Topics

Disk interfaces
Disk components
Performance
Reliability
RAID
Adding a disk
Logical volumes
Filesystems

Disk Interfaces

SCSI
- Standard interface for servers
IDE, EIDE
- Historical interface for PCs
SATA
- Serial ATA standard on PCs
Fibre Channel
- High bandwidth, SCSI or ATA
USB
- Fast enough for slow devices on PCs

SCSI

Small Computer Systems Interface
- Pronunskiation
- Fast, reliable, expensive
A bus, not a simple PC to device interface
- Each device has a target # ranging 0–7 or 0–15.
- Devices can communicate directly without CPU
Many versions
- Original: SCSI-1 (1979) — 5MB/s
- Current: SCSI-3 (2003) — 640MB/s
Serial Attached SCSI (SAS)
- Up to 128 devices
- Up to 750 MB/s full duplex

IDE

Integrated Drive Electronics / AT attachment
- Slower, less reliable, cheap
- Only allows 2 devices per interface
- ATAPI standard added removable devices
Many versions
- Original: IDE / ATA (1984) — 16.7 MB/s
- Current: Ultra-ATA/167 (2010) — 167MB/s
Serial ATA
- Up to 128 devices
- Original: SATA Revision 1.0 — 150 MB/s
- Current: SATA Revision 3.0 — 600 MB/s

SATA vs. SCSI

SCSI offers better performance/scale
- Faster bus
- Faster hard drives (up to 15,000rpm)
- Lower CPU usage
- Better handling of multiple requests
SATA often best for workstations
Convergence
- SATA2 and SAS converging on a single standard

Hard Drive Components

Actuator
- Moves arm across disk to read/write data.
- Arm has multiple read/write heads (often 2/platter.)
Platters
- Rigid substrate material
- Thin magnetic material coating stores data
- Coating type determines density: Gbits/in²
Spindle Motor
- Spins platters from 3600–15,000 rpm
- Speed determines disk latency
Cache
- 2–16MB of cache memory
- Reliability: write-back vs. write-through

Disk Information: hdparm

    # hdparm -i /dev/hde
    /dev/hde:
     Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WDWMA8C4533667
     Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
     RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
     BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off
     CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
     IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
     PIO modes: pio0 pio1 pio2 pio3 pio4
     DMA modes: mdma0 mdma1 mdma2
     UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
     AdvancedPM=no WriteCache=enabled
     Drive conforms to: device does not report version:
     * signifies the current active mode

Disk Performance

Seek Time
- Time to move head to desired track (3–8 ms)
Rotational Delay
- Time until head over desired block (8ms for 7200 rpm)
Latency
- Seek Time + Rotational Delay
Throughput
- Data transfer rate (up to 600 MB/s)

Latency vs. Throughput

Which is more important?
- Depends on the type of application
Sequential access (Throughput)
- Multimedia on a single workstation
Random access (Latency)
- Content on most web servers
How to improve performance
- Faster disks
- Disk Caching
- More spindles (disks)
- More disk controllers

Disk Performance: hdparm

    # hdparm -tT /dev/hde
    /dev/hde:
     Timing cached reads:
     876 MB in 2.00 seconds = 437.41 MB/sec
     Timing buffered disk reads:
     88 MB in 3.08 seconds = 28.60 MB/sec

Reliability

MTBF
- Mean Time Between Failure (>100,000 hours)
Real failure curves
- Early phase: high failure rate from defects
- Constant failure rate phase: MTBF valid
- Wearout phase: high failure rate from wear
Failures more likely on traumatic events
- Power on/off
Systems often wear out before MTBF
- However, disk drives still crash!

RAID

Redundant Array of Inexpensive Disks
Redundant Array of Independent Disks
Can be implemented in hardware or software.
Hardware RAID controllers:
- Supports caching
- Higher capacity
- Higher reliability
- Better throughput

RAID Levels

RAID 0: Striped evenly for performance
- MTBF = (average MTBF)/# disks
JBOD: Concatenated for capacity
- Or does it mean no RAIDing at all?
- Only data on bad disk is lost, no performance effect
RAID 1: Mirrored for reliability
- Every write goes to each disk of set
- Seek time halved as reads split between disks
RAID 0 + 1: Striped + mirrored
RAID 5: Striped with parity
- Block striping, not disk striping
- Can lose one disk of set without losing data.

Adding a Disk

Install new hardware
- Verify disk recognized by BIOS.
Boot
- Verify device exists in /dev
Partition: fdisk /dev/sdb
Create filesystem: mkfs -v -t ext4 /dev/sdb1
Add to /etc/fstab: /dev/sdb1 /proj ext3 defaults 0 2
Mount all disks: mount -a

When don’t you need a filesystem?

Swap space

    mkswap -v /dev/sdb1

Server applications (use their own)
- Oracle (OCFS)
- VMWare Server (VMFS)

Logical Volumes

What are logical volumes?
- Appear to user as a physical volume.
- But can span multiple partitions and/or disks.
Why logical volumes?
- Aggregate disks for performance/reliability.
- Grow and shrink logical volumes on the fly.
- Move logical volumes between physical devices.
- Replace volumes without interrupting service.

LVM

Logical Volume Manager

LVM Components

Logical Volume Group (LVG)
- Set of physical volumes (partitions or disks.)
- May be divided into logical volumes (LVs.)
LVs made up of fixed sized logical extents
- Each LE is 4MB.
- Physical extents are the same size.

Mapping Modes

Linear Mapping
- LVs assigned to continguous areas of PV space.
Striped Mapping
- LEs interleaved across PVs to improve performance.

Setting up an LVG and LV

Initialize physical volumes

    # pvcreate /dev/hda1
    # pvcreate /dev/hdb1

Initialize a volume group

    vgcreate nku_proj /dev/hda1 /dev/hdb1

Use vgextend to add more PVs later.

Create logical volumes

    lvcreate -n nku1 --size 100G nku_proj1

Create filesystem

    mkfs -v -t ext3 /dev/nku_proj/nku1

Extending a LV

Set absolute size

    lvextend -L120G /dev/nku_proj/nku1

Or set relative size

    lvextend -L+20G /dev/nku_proj/nku1

Expand the filesystem without unmounting

    ext2online -v /dev/nku_proj/nku1

Check size

    df -h

Swap

Can use swapfile instead of swap partition

    dd if=/dev/zero of=/swapfile bs=1024k count=512
    mkswap /swapfile

Enable swap

    swapon /swapfile
    swapon /dev/sda2

Disable swap

    swapoff /swapfile
    swapoff /dev/sda2

Check swap resource usage

    cat /proc/swaps

Filesystems

ext2
- Old Linux non-fragmenting fast filesystem
- Can be converted to ext3 by adding a journal:

    tune2fs -j /dev/sda1

ext3
- Journaling “eliminates” need for fsck
ext4
- Current common Linux filesystem
- Big files (16TB)
- Extents (range of contiguous physical blocks)
- 34-bit seconds + nanosecond timestamps
  - 2038 ≫ y2k

Other Filesystems

tmpfs, ramfs: all in memory
vfat, ntfs: Windows
exFAT: flash drive (spreads out the work)
hfs: Mac OS
procfs: /proc
cramfs, squashfs: Read-only compressed file systems
ISO9660: CD-ROM & DVD-ROM disks
UDF: CD-RW & DVD-RW

Mounting

To use a filesystem

    # mount /dev/sda1 /mnt
    # df -h /mnt

Automatic mounting
- Add entry to /etc/fstab
Unmount
- umount /dev/sda1
- Cannot unmount a volume in use.

fstab

    # /etc/fstab: static file system information.
    #
    # <file system> <mount point> <type> <options> <dump> <pass>
    proc      /proc          proc    defaults 0 0
    /dev/hdc1 /              ext3    defaults 0 1
    /dev/hdc5 /win           vfat    user,rw  0 0
    /dev/hdc7 none           swap    sw       0 0
    /dev/hdc8 /var           ext3    defaults 0 2
    /dev/hdc9 /home          ext3    defaults 0 2
    /dev/hda  /media/cdrom0  iso9660 ro,user  0 0
    /dev/fd0  /media/floppy0 auto    rw,user  0 0

/etc/fstab first field

The first field of an /etc/fstab can be:

/dev/sda1
Fails if you add another disk and the order changes.
UUID=77f85028-b4c1-4439-be1c-5a3ba7f59dd1
Robust, but cryptic. Obtain via blkid.
LABEL=JackHomeDir
Robust and self-documenting. Obtain via blkid; set via e2label.

fsck: check + repair fs

Filesystem corruption sources
- Power failure
- System crash
Types of corruption
- Unreferenced inodes.
- Bad superblocks.
- Unused data blocks not recorded in block maps.
- Data blocks listed as free that are used in files.
fsck can fix these and more
- Asks user to make more complex decisions.
- Stores unfixable files in lost+found
  - And where is this lost+found, precisely?

Lots of filesystem flavors

# cd /sbin

# ls mkfs*
mkfs         mkfs.ext2       mkfs.ext4dev  mkfs.minix     mkfs.reiserfs
mkfs.btrfs   mkfs.ext3       mkfs.fat      mkfs.msdos     mkfs.vfat
mkfs.cramfs  mkfs.ext4       mkfs.hfsplus  mkfs.ntfs      mkfs.xfs

# ls fsck*
fsck         fsck.ext3       fsck.fat      fsck.minix     fsck.reiserfs
fsck.btrfs   fsck.ext4       fsck.hfs      fsck.msdos     fsck.vfat
fsck.cramfs  fsck.ext4dev    fsck.hfsplus  fsck.ntfs      fsck.xfs
fsck.ext2

# ls mount*
mount.cifs   mount.glusterfs mount.nfs     mount.ntfs     mount.ntfs-fuse
mount.fuse   mount.lowntfs-3 mount.nfs4    mount.ntfs-3g  mountstats

References

Aeleen Frisch, Essential System Administration, 3rd edition, O’Reilly, 2002.
Charles M. Kozierok, “Reference Guide—Hard Disk Drives,” http://www.pcguide.com/ref/hdd/, 2005.
A.J. Lewis, LVM HOWTO, http://www.tldp.org/HOWTO/LVM-HOWTO/index.html, 2005.
H. Mauelson and M. O’Keefe, “The Linux Logical Volume Manager,” Red Hat Magazine, http://www.redhat.com/magazine/009jul05/features/lvm2/, July 2005.
Octane, “SCSI Technology Primer,” http://arstechnica.com/paedia/s/scsi-1.html, 2002.
RedHat, RHEL4 System Administration Guide, http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/sysadmin-guide/, 2005.

CT320: Network and System Administration

Fall 2015

Storage