Skip to main content

MySQL for Developers

Posted by caroljmcdonald on September 16, 2008 at 7:22 PM PDT




http-equiv="content-type">
mysql




this is a write up of info I gathered from   href="http://jpipes.com/"> Jay Pipes,  href="http://oreilly.com/catalog/9780596101718/">High Performance MySQL
, Colin Charles and others
(see References below) for a Sun Tech Days MySQL presentation. style="font-weight: bold;">



MySQL powers many high-volume Web sites,  including industry
leaders such as Yahoo!, Alcatel-Lucent, Google,
Nokia, YouTube, and Zappos.com.


MySQL for Developers

If you are a developer using MySQL, you should learn enough to take
advantage of its strengths, because having an understanding of the
database can help you develop better-performing applications. This can
be especially important for Hibernate or JPA developers, since ORM
frameworks tend to obscure the database schema and SQL  for the
developer, which can lead to poorly-performing index and schema
strategies and inefficient SQL.


MySQL Storage Engine Architecture

src="http://weblogs.java.net/blog/caroljmcdonald/archive/pluggaglearchitecture.gif"
height="390" width="600">


A key difference between MySQL and other database platforms is the
pluggable storage engine architecture of MySQL, which allows you to
select a specialized storage engine for a particular
application need such as data warehousing, transaction processing, high
availability...

A storage engine is  responsible for storing and retrieving all
the data stored .  The storage engines have different
functionality, capabilities and performance characteristics, in many
applications choosing the right storage engine can greatly improve
performance. Here is a brief summary of some of the more common
engines:

  • MyISAM
    - offers high-speed query and
    insert capability, is nontransactional,
    provides table-level locking, full-text indexing, compression, GIS
    functions, and supports indexes. MyISAM is a good choice for read-only
    or read-mostly applications that don't require transactions and issue
    primarily either SELECT or INSERT queries, such as Web Catalogs or
    listing of jobs, auctions, real estate, etc., or data warehousing
    applications. You can use multiple storage engines in a single
    application, sometimes specific components of an application (such as
    logging) fall into this category. MyISAM is not a good general purpose
    storage engine for high concurrent UPDATEs or DELETEs since these
    obtain exclusive write locks, however new rows can be inserted with
    shared read locks.
  • InnoDB
    - supports ACID
    transactions, multi-versioning, row-level locking, foreign key
    constraints, crash recovery,  and  good query performance
    depending on indexes. InnoDB uses row-level locking with multiversion
    concurrency control (MVCC). MVCC can allow fewer row locks by keeping
    data snapshots. Depending on the isolation level, InnoDB does not
    require any locking for a SELECT. This makes high
    concurrency possible, with some trade-offs: InnoDB
    requires more disk space compared to MyISAM, and
    for the best performance, lots of memory is required for the InnoDB
    buffer
    pool. InnoDB is a good choice for any order processing application, any
    application where transactions are required.  
  • Memory
    - stores all data in RAM for
    extremely fast access. Useful when you need fast access to data that
    doesn't change or doesn't need to persist after a restart.  Good
    for "lookup" or "mapping" tables, for caching the results of
    periodically aggregated data, for intermediate results when analyzing
    data.

    MEMORY tables do not support variable-length columns, this means that
    VARCHAR columns get implicitly converted to CHAR columns, and it is
    impossible to create a MEMORY table with TEXT or BLOB columns.
  • Merge
    - allows to logically group together a series of identical
    MyISAM tables and reference them as one object. Good for very large DBs
    like data warehousing.
  • Archive - provides for
    storing and retrieving large amounts of
    seldom-referenced historical, archived, or security audit information.

More specialized engines:

  • Federated - Allows a
    remote server's tables to be used as if they
    were local, by linking together separate MySQL servers as one logical
    database. Can be good for distributed or data mart environments. 
  • Cluster/NDB
    -Highly-available clustered storage engine. 
    Should be used where very high speed, availability and
    redundancy are absolute requirements. Other tables (even in the same
    database)
    should use one or more of the other storage engines.
  • CSV-references
    comma-separated files as database tables.Useful
    for large bulk imports or exports
  • Blackhole- the /dev/null
    storage engine for temporarily disabling
    application input to the database .  Useful for benchmarking and
    some replication scenarios.

You can use multiple storage engines in a single application, a storage
engine for the same table on a slave can be different than that of the
master. This can be very useful for taking advantage of an engine's
abilities.


Schema, the basic foundation of
performance

Database normalization minimizes duplication of information, this makes
updates simpler and faster because the same information doesn't have to
be updated in multiple tables.  In a denormalized database,
information is duplicated, or stored in multiple places. With a
normalized database:

  • updates are usually faster.
  • there's less data to change.
  • tables are usually smaller, use less memory, which can give
    better performance.
  • better performance for distinct or group by queries

The disadvantages of a normalized schema are queries typically involve
more tables and require more joins which can reduce performance. 
Also normalizing may place columns in different tables that would
benefit from belonging to the same index, which can also reduce query
performance.  More normalized schemas are better for applications
involving many transactions, less normalized are better for reporting
types of application.  You should normalize your schema first,
then de-normalize later.  Applications often need to mix the
approaches, for example use a partially normalized schema, and
duplicate, or cache, selected columns from one table in another table.

Data Types

In general, try to use the smallest data type that you can.  Small
and simple data types usually give better performance because it 
means fewer disk accesses (less I/O), more data in memory, and less CPU
to process operations.

Numeric Data Types

MySQL has 9 numeric data types.



Whole Numbers:

TINYINT, SMALLINT, MEDIUMINT, INT,
BIGINT:   require 8, 16, 24, 32, and 64 bits of storage
space, respectively. They can store values from –2 style="font-style: italic;">(n-1)
to 2(n-1)–1, where N is the number of bits of
storage space they use. 



Use UNSIGNED when you don't need negative numbers, this doubles the
bits of storage space.  BIGINT is not
needed for AUTO_INCREMENT,  INT UNSIGNED stores 4.3 billion
values!



INT(1) does not mean 1 digit!  The number in parentheses is the
ZEROFILL argument, and specifies the number of characters some tools
reserve for display purposes. For storage and computational purposes,
INT(1) is identical to INT(20).



Integer data types work best for primary key data types. 



Real Numbers
:

FLOAT, DOUBLE: supports approximate
calculations with standard floating-point math.

DECIMAL: use DECIMAL when you need exact results, always use for
monetary/currency fields.



Other:

BIT: to store  0,1 values. BIT(1)
stores 1 bit, BIT(2) stores 2... max is 64.


Character Data Types

The CHAR and VARCHAR types are
declared with a length that
indicates the maximum number of characters to store. style="font-weight: bold;">

VARCHAR(n) stores variable-length character strings. VARCHAR
uses only
as much space as it needs, which helps performance because it saves
disk space. Use when the maximum column length is larger than the
average
length and when updates to the field are rare, so fragmentation is not
a problem.  VARCHAR
gives more efficient disk usage, however MySQL internal memory buffers
are fixed width, so  internal memory usage isn't more efficient,
therefore try to keep VARCHAR Max Length short.

CHAR(n) is fixed-length: MySQL
allocates enough space for the specified
number of characters. Useful to store very short strings, when all the
values are nearly the same length, and  for data that's changed
frequently.

Use NOT NULL

Always define columns as NOT NULL unless there is a very good reason
not to do so:

  • can save up to a byte per column per row of data
  • nullable columns make indexes, index statistics, and value
    comparisons more complicated.


Indexes

Indexes are data structures that help retrieve row data with specific
column values faster. Indexes can especially improve performance for
larger data bases.  Most
MySQL storage engines support  B-tree indexes. a B-tree is
(remember data strucutres 101 ?) a tree
data structure that sorts data values, tree nodes define the upper and
lower bounds of the values in the child nodes.  Leaf nodes have
pointers to the data instead of pointers  to child nodes. B-trees
are kept balanced by requiring that all leaf nodes are at the same
depth.  MyISAM leaf nodes refer to the indexed row by the position
of the row,  InnoDB leaf nodes refers to the index by its primary
key values.



InnoDB's clustered indexes store
the row data in the leaf nodes, it's called clustered because rows
with close primary key values are stored close to each other. 
This can
make retrieving indexed data fast, since the data is in the
index.  But
this can be slower for updates , secondary indexes, and for full table
scans.



Covering Indexes are indexes
that contain all the data values needed for a query, these queries can
improve performance because the row does not have to be read.


Know Every SQL Statement

You need to understand the SQL
queries your application makes and  evaluate their performance.
For this you can harness the href="http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html">MySQL
slow query log and use EXPLAIN. 

Consider adding
an index for queries
that are taking too long, or cause a lot of load. Know how your query
is executed by MySQL, consider disk I/O in
particular.  Basically you want to make your queries access less
data, Indexes are a good way to reduce data access.



Know Every SQL Statement for
Hibernate/JPA developers


If you are using Hibernate, JPA, or
another ORM framework you want to look at the generated SQL-statements.
If you are using Hibernate,
enable the Hibernate SQL log. If you are using Glassfish
with Toplink as the JPA provider, you can href="http://www.oracle.com/technology/products/ias/toplink/jpa/howto/configure-logging.html">
log the SQL that is being sent to the database by adding this
property in the definition of the persistence unit in the persistence.xml

file, as follows:
 value="FINE">.
Then run a use case of your application and
examine the SQL statements which are executed by JPA or your ORM
framework. You want to make sure you are only retrieving the data your
application needs, that is you want to optimize the number of SQL
statements executed (see lazy loading below), and you want to examine
the execution plan for queries.


Optimize data access

Basically you want to make
your queries access less
data:

  • is your application retrieving more data than it needs, are
    queries accessing too many rows or columns?
  • is MySQL analyzing more rows than it needs?

Understanding EXPLAIN

EXPLAIN provides the execution
plan chosen by the MySQL
optimiser for a
specific SELECT statement.  It is important to run EXPLAIN on all
SELECT statements that your code is executing against the
database.  This step ensures that missing indexes are picked up
early in the development process and gives developers insight into how
the MySQL optimizer has chosen to execute the query.



To use it just precede a SELECT statement
with the keyword EXPLAIN and  MySQL
will display information from the optimizer about the query execution
plan, which explains how it would process the SELECT
including information about how tables are joined and in which order.
The EXPLAIN EXTENDED SELECT variant, followed by SHOW WARNINGS gives
some additional information, including the final rewritten query as
used by the optimizer.

src="http://weblogs.java.net/blog/caroljmcdonald/archive/explain.gif"
height="620" width="937">

With the help of EXPLAIN, you can see
where you should add indexes to tables to get a faster SELECT that uses indexes to find rows. You can
also use EXPLAIN to check whether the
optimizer joins the tables in an optimal order.



EXPLAIN
returns a row of
information for each "table"
used in the SELECT statement, which shows
each part and the order of the execution plan.  The "table" can
mean a real schema table, a derived or temporary table, a subquery, a
union result. Here is an explanation of some of the output for each
row:

  • table:  the real
    table name or an alias.
  • select_type: shows
    whether the row is a simple or complex SELECT. class="literal"> If complex it can be:
    PRIMARY (Outermost SELECT), UNION, SUBQUERY, DERIVED, or UNCACHEABLE.
  • type: shows the style="font-weight: bold;">"access strategy" type to find rows.
    The different access types ordered from the best  to worst:

    • system, or const: very fast because  the table has at
      most
      one matching row (For example a primary key used in the WHERE)
    • eq_ref: index lookup returning one value
    • ref: index access
    • fulltext:  FULLTEXT index.
    • ref_or_null:  like ref, but with
      an
      extra search for rows that contain NULL values.

    • range:   index scan to select the rows in given
      range
    • index:   index tree is scanned.
    • ALL: A full table scan
  • key: the index that MySQL
    decided to use, from the possible_keys, to look up rows.
  • rows: the number of rows
    MySQL estimates it must examine to execute the query.
  • ref: columns or
    constants which are compared to key to select rows .
  • Extra: additional
    information about how MySQL resolves the query. Watch out for Extra
    values of Using filesort
    and Using temporary
    Using
    index
    means information is retrieved from the table using only
    information in the index tree without having to do an additional seek
    to read the actual row. This strategy can be used when the query uses
    only columns that are part of a single index ( style="font-weight: bold;">Covering Index).


When do you get a full table scan?

  • No WHERE condition
  • No index on any field in WHERE condition
  • When your range returns a large number of rows, i.e. too many
    records in WHERE condition
  • When optimizer sees that the condition will return > ~20% of
    the records in a table, it will prefer a scan versus many seeks
  • Pre-MySQL 5, using OR in a WHERE clause
    • now fixed with an index merge, so the optimiser can use more
      than one index to satisfy a join condition

How do you know if a scan is used?

In the EXPLAIN output, the “ style="font-weight: bold;">type” for the table/set will be
“ALL” or “index”.  “ALL” means a full table data record scan is
performed.  “index” means a full index record scan.  Avoid
this by ensuring indexes are on columns that are used in the WHERE, ON,
and GROUP BY clauses.


Scans and seeks

A seek jumps into a random
place (on disk or in memory) to fetch
data. A scan will jump to the start of the data, and sequentially read
(from
either disk or memory) until the end of the data. With large amounts of
data, sequentially scanning through contiguous data on disk or in
memory
is faster than performing many random seek operations.  But scans
can also be a sign of poor indexing, if an index contains many
rows with the same value it is not very useful.  MySQL keeps stats
about the uniqueness of values in an index in order to
estimate the rows returned (rows in
the explain output). If the rows is
greater than a certain  % of rows in the table, then MySQL will do
a scan.



Indexed columns and functions
don't mix

Indexes can quickly find the rows that match a WHERE clause, however
this works only if the index is NOT used in a function or expression in
the WHERE clause. Below is an example where an index is used :



mysql> EXPLAIN SELECT * FROM film WHERE title LIKE 'Tr%'\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: film

         style="color: rgb(204, 0, 0); font-weight: bold;">type: range

possible_keys: idx_title

          style="color: rgb(204, 0, 0); font-weight: bold;">key: idx_title

      key_len: 767

          ref: NULL

         rows: 15

        Extra: Using where




In the example above, a  fast range "access strategy" is chosen by
the optimizer, and the index scan on title is used to winnow the query
results down.

Below is an example where an index can not be used :



mysql> EXPLAIN SELECT * FROM film WHERE  style="font-weight: bold; color: rgb(204, 0, 0);">LEFT(title,2)
= 'Tr' \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: film

         style="font-weight: bold;"> style="color: rgb(204, 0, 0); font-weight: bold;">type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 951

        Extra: Using where




A slow full table scan (the ALL"access
strategy") is used because a function ( style="color: rgb(153, 0, 0);">LEFT) is operating on the title
column. Operating on an indexed column with a function (in this case
the LEFT() function) means the optimizer cannot use the index to
satisfy the query.  Typically, you can rewrite queries in order to
not operate on an
indexed column with a function.


Solving multiple problems in one
query

Here is an example of improving a query:



SELECT * FROM Orders WHERE TO_DAYS(CURRENT_DATE()) –  style="font-weight: bold;">TO_DAYS(order_created) <= 7;


First, we are operating on an indexed column (order_created)
with a function TO_DAYS – let's fix that:



SELECT * FROM Orders WHERE order_created >=  style="font-weight: bold;">CURRENT_DATE() - INTERVAL 7 DAY;




Although we rewrote the WHERE expression to remove the function on the
index, we still have a non-deterministic function CURRENT_DATE()

in the statement, which eliminates this query from being placed in the
query cache. Any time a
non-deterministic function is used in a SELECT statement, the query
cache ignores the query.  In read-intensive applications, this can
be a significant performance problem.
let's fix that:



SELECT * FROM Orders WHERE order_created >= ' style="font-weight: bold;">2008-01-11' - INTERVAL 7 DAY;




We replaced the function with a constant (probably using our
application programming language).  However, we are specifying
SELECT * instead of the actual fields we need from the table. 
What if there is a TEXT field in Orders called order_memo that we don't
need to see?  Well, having it included in the result means a
larger result set which may not fit into the query cache and may force
a disk-based temporary table.
let's fix that:



SELECT order_id, customer_id, order_total, order_created

FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;



A Few Things to consider for 
JPA/Hibernate  devlopers:

Lazy loading and JPA

With
JPA many-to-one and many-to-many relationships lazy load by default ,
meaning they will be loaded when the entity in the relationship is
accessed. Lazy loading is usually good, but if you need to access all
of the "many" objects in a relationship, it will cause n+1 selects
where n is the number of  "many" objects.  You can change the
relationship to be loaded eagerly as follows : style="font-weight: bold;">


public class Employee{

    @OneToMany(mappedBy = "employee", style="font-weight: bold;">fetch = FetchType.EAGER)
    private Collection
addresses;

.....
}

However you should be careful with eager loading which could cause
SELECT statements that fetch too much data. It can cause a Cartesian
product  if you eagerly load entities with several related
collections. 



If you want to temporarily override the LAZY fetch type, you could
use Fetch Join.  For example this query would eagerly load the
employee addresses: 

@NamedQueries({ @NamedQuery(name="getItEarly",
                 query="SELECT e FROM Employee e JOIN FETCH e.addresses")})

public class Employee{
.....
}

Optimistic locking and JPA

The Java Persistence API 1.0 persistence providers support optimistic
locking, where your application will get an exception whenever a
transaction tries to commit an object that was updated in the database
since this transaction began. To enable this for a entity, you need to
add a version attribute:



@Entity

public class Item {

    ...

    @Version

    private int version;

    ...

}



In Conclusion

  • Understand the storage engines
  • Keep data types small and compact
  • Understand your query execution plans with the EXPLAIN output
  • Understand the scan vs. seek choice the optimizer must make
  • Don't mix functions and indexed columns


References

High Performance
MySQL book


MySQL Pluggable
Storage Engine Architecture


href="http://dev.mysql.com/tech-resources/articles/storage-engine/part_2.html">MySQL
Storage Engine Architecture, Part 2: An In-Depth Look

Optimizing
Queries with EXPLAIN


Java Persistence with
Hibernate book


Jay Pipes blog  style="text-decoration: underline;">

Colin Charles blog

mysql performance blog

Ronald Bradford blog

Taking
JPA for a Test Drive


Pro EJB 3: Java
Persistence API


href="http://dev.mysql.com/tech-resources/articles/pro-mysql-ch6.html">Pro
MySQL, Chapter 6: Benchmarking and Profiling










Comments

JPA defines that when any of the object's tables changes the version is updated. However it is less clear on relationships. If Basic, Embedded, or a OneToOne, ManyToOne relationship changes, the version will be updated. But if OneToMany, ManyToMany changes making an update to the version may depend on the JPA provider.

nice article, thanks. I though have an issue regarding the optimistic lock. Is there a way to increment not the version number of an entity when a property of it having an OneToMany, ManyToMany or ManyToOne relation is dirty?

thanks!

Very thorough article, thanks.