Site icon IT World Canada

How to select a graph database

Shutterstock

Graph databases are the fastest-growing category in all of data management. Graph databases have evolved into a mainstream technology that has been successfully implemented by organizations in every industry to support a wide variety of applications. Organizations are attracted to graph databases to meet big and complex data challenges that traditional databases such as relational and NoSQL are not capable of conquering.

Selecting a graph database is an important decision that can advance the business plan of your organization. Because graph database software is relatively new and changing rapidly, buyers often struggle to reconcile the conflicting claims made by different graph database vendors.

Gaurav Deshpande, VP of Marketing at TigerGraph, said “Graph databases create huge value for our customers in specific applications including data analytics, machine learning, fraud detection, and single customer view.”

To assist IT management in the graph database software selection process, this article describes the major selection criteria.

Functionality

The functionality of a graph database software package refers to the range of operations that can be performed by the graph database and is a critical part of the overall evaluation.

Query execution speed

Query execution speed is the functionality to query large data volumes and produce results in real-time.

Query execution speed is often the most important selection criterium because graph databases are used for applications that need to process large data volumes quickly. Insufficient query execution speed limitations mean processing occurs after-the-fact and results will require follow-up rather than immediate resolution.

Update and insert execution speed

Update and insert execution speed are the functionality to update graph databases quickly.

Update and insert speed are important because database updates compete for server resources against the active query load. Organizations try to minimize these periods of competition because they slow response time. Read replicas, described below, can also be used to substantially reduce this competition.

Scalability

Scalability is the functionality to distribute graph databases, through partitioning, across multiple physical servers that comprise a cluster of servers.

Scalability is important because it is not unusual for large graph databases to exceed the storage size limit that a physical server can access. Inadequate scalability can severely limit the size of a graph database.

Ease-of-use

Graph databases from different vendors vary considerably in the ease-of-use of various parts of their functionality intended for use by the following typical categories of end-users:

  1. Business analysts and software developers use the query development and execution environment.
  2. Software developers spend much of their time using the integrated development environment (IDE) to develop application software that loads, updates and integrates data in the graph databases.
  3. Database administrators (DBAs) use the database management environment to monitor and manage the operation of the graph databases.
  4. Data modellers use the database management environment to implement and revise the schemas of the graph databases.

Ease-of-use of the functionality associated with the graph database software is important because it determines:

  1. How productive and effective the staff interacting with the database will be.
  2. The extent to which the availability target for the graph database will be achieved.

Technology

Evaluating the information technology incorporated into graph database software packages under consideration requires considerable expertise and is a critical part of the overall evaluation.

Graph query language

Graph databases from different vendors come with different graph query languages, unlike relational databases that come with largely standardized SQL. These graph query languages, often like SQL, vary in their capability including the extent of their:

  1. Turing-completeness.
  2. Ability to express graph computations.
  3. Ability to process analytics natively.
  4. Support for ad hoc queries.
  5. Support for complex, parameterized procedures.

The absence of sufficient capability leads to additional software development effort before an application can be used routinely in production.

In-database analytics

In-database analytics is the technology to process the analytics within the database by the database engine.

The absence of this analytics technology requires analytics processing to occur externally. That processing adds infrastructure cost, complexity, and elapsed time to complete tasks.

Integrated development environment

Graph databases from different vendors vary in the scope and functionality of the integrated development environment (IDE) they provide with the graph database. Desirable functionality includes:

  1. A visual interface, rather than a command-line interface for exploration, update, and query development.
  2. Visual data modeling.
  3. Extract, transform, and load (ETL) support.
  4. Support for monitoring and management of graph databases.

The absence of sufficient IDE functionality reduces software developer productivity and leads to:

  1. Additional software development effort before an application can be used routinely in production.
  2. Longer elapsed times to implement enhancements and resolve software defects.

Data loading performance

Graph databases from different vendors vary in resource consumption and elapsed time required to perform bulk data loading tasks that occur frequently in the operation of graph databases. The range of supported input data formats is a related selection criterium.

The absence of high-performance bulk data loading means that this process will slow the overall performance of the graph database.

Database engine design features

Graph databases from different vendors vary in the technology available as a result of their database engine design choices. Major design features include:

  1. Distributed or multi-node graph storage offers a much higher ultimate limit on database size compared to single-node graph storage.
  2. Massively parallel processing offers a much higher limit on the number of queries and updates that can be performed concurrently compared to limited or no parallel processing.
  3. Compressed data storage offers a higher ultimate limit on database size compared to uncompressed data storage.
  4. Schema-first design optimizes query performance compared to schema-free design.
  5. Schema-free design reduces effort and disruption associated with data structure changes compared to schema-first design.
  6. Read replicas offer the ability to separate the query load from the update load onto separate server clusters to significantly raise the point where the graph database update performance starts to slow.

Deep-link analytics

Deep-link analytics is the technology that supports query and insert processing when complex data structures are involved. Processing of complex data structures requires the ability to handle 5 to 10+ hops or relationships efficiently on all sizes of databases.

Deep-link analytics is important because without it query and insert execution speed deteriorates markedly when more hops are involved, or the query cannot be processed at all.

In-database machine learning

In-database machine learning is the technology to process machine learning algorithms within the database by the database engine.

The absence of this technology requires machine learning processing to occur externally. That processing adds infrastructure cost, complexity, and elapsed time to complete tasks.

Transaction and cluster consistency

Graph databases from different vendors vary in their technology that guarantees that database transactions are processed reliably. This concept is called atomicity, consistency, isolation, and durability (ACID).

Maintaining ACID is more complex for a cluster of servers than for a single server. The absence of ACID technology for a cluster introduces a severe limitation on the size and complexity of applications.

Graph algorithm library

Graph databases from different vendors vary in the scope and technology of the algorithm library they provide with the graph database. Example capabilities include:

  1. Number or richness of graph algorithms.
  2. Extensibility of graph algorithms through code.
  3. Customizability of graph algorithms through parameters.

Software developers incorporate components from the library into their application software. This library greatly reduces software development and testing effort.

The absence of sufficient algorithms can lead to significant additional software development and testing effort before an application can be used routinely in production.

Standard APIs

Graph databases from different vendors vary in their support for industry standards such as REST APIs, JSON output, JDBC, Python, and Spark.

The absence of sufficient support for industry standards can lead to significant additional software development and testing effort before an application can be used routinely in production.

OLAP and OLTP workloads

Graph databases from different vendors vary in the technology available to support both OLAP and OLTP workloads.

The design of graph databases is oriented strongly toward handling large online analytical processing (OLAP) workloads efficiently.

However, as graph databases increase in prominence in organizations, it becomes inevitable that they will also be expected to handle online transaction processing (OLTP) workloads.

The absence of robust OLTP support means that the OLTP workload will consume more server resources that will slow the overall performance of the graph database.

Alternatively, this OLTP issue can be addressed by simply implementing a policy that the graph databases will not be used for OLTP workloads.

Software Vendor

Evaluating the vendors of graph database software packages is quite difficult due to a rapidly changing marketplace but should form part of the overall evaluation.

Quite new technology

Graph database concepts have only recently emerged from academia. Much of the technology that underlies graph databases is quite new.

This situation creates the risk of software instability in graph database software packages. Instability undermines the high availability customers expect from their production-quality applications.

Recently founded vendors

The many opportunities that customers see that can be addressed by applications based on graph databases have in turn attracted several vendors to offer graph database software packages.

These graph database vendors vary greatly in terms of organizational maturity. It is difficult to know which vendors will:

  1. Grow rapidly, with lots of venture capital help and superior marketing, and then implode due to an inability to manage the meteoric growth.
  2. Merge with another vendor and then merge product lines in ways that will adversely impact some of their customers.
  3. Be acquired by a larger entity that will cause product development to slow rather than the promised acceleration.
  4. Abandon their graph database products due to a lack of market acceptance and negative reviews.

CIO’s should clearly spell out these vendor risks to management as part of the graph database software package selection and acquisition process.

Rapid product changes

The rapid growth in sales of graph database software packages has triggered an arms race among competing vendors to offer the most features, the latest technology, and the most attractive pricing.

This situation results in rapid product changes that mean the recommendations of a graph database software package selection process will be valid only for a surprisingly short period of time.

Developer Community

The presence of an active and growing developer community for a vendor’s graph database software package can be an invaluable resource for application ideas and problem resolution.

Implementation

The skills, elapsed time, and cost to implement a graph database software package are not trivial and should form part of the overall evaluation.

Data migration

Graph databases from different vendors vary in the technology available to support data migration into a graph database.

Ease of data migration from other datastores, using different data formats, is important because the implementation of a graph database always involves significant data migration.

Training

Vendors vary in their approach to training your staff to perform the implementation of a graph database.

Given that graph databases are comparatively new, it is unlikely that an organization can hire the experience they require. This lack of experience makes a training plan essential for the success of a graph database implementation project.

Starter kits

Vendors vary in their approach to helping customers climb the learning curve associated with graph databases.

Starter kits can reduce the cost of implementation of a graph database and shorten the elapsed time to business value.

Cost of Ownership

Evaluating the cost of ownership of graph database software packages should be a lower priority part of the overall evaluation. If cost considerations are dominating the selection process, the likely problem is an inadequate business case for the graph database application.

Total cost of ownership

Graph databases from different vendors vary considerably in the typical components of the total cost of ownership such as:

  1. Software license fees – low or zero open source license fees are offset by the cost of more staff to support the software.
  2. Software maintenance fees.
  3. Support cost.
  4. Version upgrade implementation cost – expect to install a new version multiple times per year.
  5. Operating cost – discussed in the next section.

Operating cost

Graph databases from different vendors vary considerably in the consumption of computing resources required to process a given query or update. The differences are significant for all components of computing resources:

  1. Storage.
  2. Compute.
  3. Memory.
  4. Input/output.

The cost of computing resources is usually seen as not that important because of their declining share of the total application operating cost. However, the computing resources consumed by graph database-oriented applications is significant. This consumption difference across graph database software packages is noticeable and affects the cost of operating the graph database environment.

Cost of a cloud offering

Because operating applications in the cloud is demonstrably cheaper than operating the same applications on-premise, some organizations lose operating cost awareness until the invoices from the cloud service provider start to arrive. The experience of many organizations is that the cloud cost advantage is quickly swallowed up by a much higher consumption of computing resources.

What strategies would you recommend for selecting a graph database software package? Let us know in the comments below.

Exit mobile version