Friday, 14 July 2017

The Big Data DBA

Many IT enterprises are starting pilot projects to implement big data solutions. As a DBA, are you ready to support these efforts, and integrate them into your current architecture, processes, and standards?

Big Data, DBA

Where Are You Now?

Many DBAs consider themselves already supporting a big data environment. Table sizes in the hundred gigabyte range are common and terabyte tables are no longer rare, especially in an enterprise data warehouse. Yet most of the current Big Data hype concerns something new: a combination of new data structures (or lack of structures), dissimilar architectures and the requirement for data integration.

What Was Big Data?

When a Big Data problem appears, we think of it as a “scaling up” issue. Professionals have dealt with Big Data problems for centuries.  Two examples include the supply and logistics for armies (think of Napoleon invading Russia with 300,000 troops) and the management of a vast political empire (England in the 16th century through the early 20th century).

My favorite example of Big Data is the Arsenal of Venice.

In the 13th century the Arsenal was one of the primary shipbuilding cities of the globe.  It became the biggest in about 1320 after an expansion. Workmen specialized in only a few tasks, worked in assembly-line fashion, and were paid based on their work output (foreshadowing the way Henry Ford would design automotive assembly lines). Supervisors used an early form of performance appraisal to rate workers, and multiple forms of double-entry bookkeeping in order to track work-in-progress through the Arsenal.

Scaling up meant acquiring special hardware and staff. For the Arsenal, hardware meant special water locks so that ships in the process of being built could move from one location to the next using waterways. More shipwrights, more wagons to deliver materials such as wood and rope, and of course more paper.

The arsenal used new processes, specialized workers, and special-purpose hardware. It was capable of outfitting and producing fully equipped merchant or naval vessels at the rate of one per day. This was in contrast to other shipbuilders in Europe where the production of a similar sized vessel could often take months. Doesn't this sound exactly like a Big Data implementation?

Big Data in the 21st Century

Big Data today is different in many ways. We are now faced with new and complex data types (large objects, or LOBs), self-describing data (XML) and multi-structured data (images, audio, video, click-streams). This is in addition to the expected high volumes and speeds. Big Data today is not only a scale-up issue; it is also a re-architecture issue and a data integration issue. Further, it often involves integration of dissimilar architectures involving new data types.

The primary goal for the DBA is to research how the first enterprise big data applications should be integrated into current best practices.  There are three best practice categories that demand immediate attention.

Data Recoverability

This is the DBA’s highest priority. While other things such as performance may demand your attention, database recovery is your most important responsibility.

Most big data pilot projects typically involve standalone implementations of special hardware and software for gathering and analyzing large volumes of data. Instinctively, the DBA will not include this data in a recovery strategy for two reasons. First, the data usually is extracted from the source system, which should already be backed up. Second, a big data application used for ad hoc queries and analysis is usually given a low priority for disaster recovery.

However, many big data implementations are considered mission-critical. Even a pilot project may be deemed critical by the line of business that uses the system.

In these cases, the DBA may be required to implement a recovery scheme for the data store associated with the big data application. This will need to take into account that most big data applications are not self-contained. Queries accessing the big data repository must also access other business data. For example, it is common to store the largest data tables in a special-purpose big data appliance (such as the IBM IDAA), while keeping other tables on the main server. SQL queries then join tables in multiple locations.

In architectures like these the DBA must construct a database backup strategy that allows recovery of all related tables to a consistent point in time.

The DBA should create, document, and test the following:
  • A regularly scheduled process for determining (and documenting) the recovery status of all production objects, including related big data tables;
  • Regular measurements of required recovery times for objects belonging to critical applications;
  • Development of alternative methods of backup and recovery for special situations (such as image copy of indexes, data replication to recovery site, and DASD mirroring);
  • Regular development, improvement and review of data recoverability metrics.

Process Automation

A big data implementation will make additional demands on the DBA’s time, including required education and training, support of new analytics users, and management of new hardware and software. How will the DBA find the time for these new tasks in addition to supporting current systems? The answer is process automation.

When the DBA is able to automate processes it frees them for other work. Simple, repetitive processes are the easiest to automate. Some examples are:
  • Executing an EXPLAIN process for SQL access path analysis;
  • Generating performance reports such as System Management Facility (SMF) accounting and statistics reports;
  • Verifying that new tables have columns with names and attributes that follow standard conventions and are compatible with the enterprise data model and data dictionary;
  • Verifying that access to production data is properly controlled through the correct authority GRANTs;
  • Monitoring application thread activity for deadlocks and timeouts;
  • Reviewing console logs and DB2 address space logs for error messages or potential issues.
Each of these should be replaced by an automated reporting or a data gathering process. With such processes in place, DBAs now can schedule data gathering and report generation for later analysis, or guide requestors to the appropriate screens, reports or jobs. This generates time for proactive tasks such as projects, architecture, planning, systems tuning, and more.

The advantage of automation isn't merely speed; automating tasks helps move the DBA away from reactive tasks such as reporting and analysis toward more proactive functions. These might include detailed systems performance tuning, quality control, cost/benefit reviews of potential new applications and projects, and more. Management understands that a DBA spending time on trivial tasks represents a net loss of productivity

Total Cost Management

Part of the DBA’s job is to give management the data they need to measure productivity and prioritize the team's work. Faced with distributing and prioritizing work across a database support team, management usually creates or extends a spreadsheet or project plan to include task categories and team assignments.

In seeking to increase productivity, management needs an important piece of information: the variability of the DBA's work. This usually translates to whether the tasks that need to be done have a fixed amount of effort and can be pre-scheduled or consist of a variable amount of work that unpredictably arrives. In other words, are the DBAs being reactive, active, or proactive?

Many big data pilot projects have obvious initial costs, including hardware leasing and power costs, software licenses, and so forth. DBAs need to research their education and training needs as well. Last, there can be other, more subtle costs. The time you spend supporting the initial big data implementation means less time on other projects, resulting in delayed deliverables, missed deadlines, or even project cancellations.

The DBA must keep management informed about priorities and time estimates. This will ensure that resources can be allocated to the correct projects.