Reporting Queries with Sphinx

Reporting queries (I will use this term here) are the queries which summaries and groups data over the certain period of time. For example, in Social Network site we want to know how many messages have been sent for the given period of time, group by region and status (sent, received, etc), order by number of messages sent.

As an example I will take a table which is used to send SMS (text messages).

SQL: select concat('+', substring(region_code,1 ,2), 'xxx') as reg, status, count(*) as cnt from messages where submition_date between '2009-01-01' and '2009-04-01' group by reg, status having cnt>100 order by cnt desc, status limit 100;

This query will do a range scan over the submition_date and perform a filesort. There are common well known approaches which can be used to optimize table (“covered index”, “summary tables”, using external data warehouse, etc). Sometimes those approaches do not work or too complex.

Yet another approach is to use external search/index solution, for example Sphinx Search (http://www.sphinxsearch.com). In this case, data will be stored in MySQL and sphinx will be used as an external indexer/searcher, with SQL protocol support.

Using Sphinx

Starting with version 0.9.9-rc2, Sphinx searchd daemon supports MySQL binary network protocol and can be accessed with regular MySQL API. For instance, ‘mysql’ CLI client program works well. Here’s an example of querying Sphinx using MySQL client:

$ mysql -P 3307 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 0.9.9-dev (r1734)

As Sphinx can use attributes (“fields”) and group/sort then, it can be used for our report. Also, an application can simply connect to Sphinx server with MySQL protocol: an application will think it will work with MySQL (there are minor differences in Sphinx SQL, like “@count” and support of timestamps only instead of datetime)

Here is the example of the above query in Sphinx:

mysql> select * from messages_dw where submition_date > 1230793200 and submition_date < 1238569200 group by region_code order by @count desc limit 0,10;
10 rows in set (0.19 sec)

Same query in MySQL 5.1 runs much slower:

select region_code, count(*) as cnt from messages_dw where submition_date > '2009-01-01' and submition_date < '2009-04-01' group by region_code order by cnt desc limit 0,10; 10 rows in set (14.47 sec)

2 import notes:

For now, Sphinx can’t group by more than one field. However, we can combine 2 fields in 1 and then group by this new field. Here the example of how we can do it:
In the configuration file (in searchd section) we need to set max_matches to very large number (max_matches = 10000000 for example). By default, Sphinx will not generate exact counts (and all other average functions); this was done for the purpose of speed. However, setting max_matches to large number fixes this issue.

mysql> select BIGINT(region_code)*4*1024*1024*1024+status_code as reg_status, * from messages_dw where date_added > 1230793200 and date_added < 1238569200 group by reg_status order by @count desc, region_code limit 0,10;

More speed comparison, group by 2 fields:

Sphinx:

MySQL:

mysql> select region_code, status+0, count(*) as cnt from messages_dw where submition_date between '2009-01-01' and '2009-04-01' group by region_code, status order by cnt desc, region_code limit 0,10;
10 rows in set (14.47 sec)

Conclusion

If you need fast ad-hock reporting queries, SphinxSearch can be a good option.
Advantages:

Faster sorting and grouping (which is very important for reporting queries)
No need to use external API for queries, Sphinx now supports mysql protocol

Disadvantages:

Need to run additional Sphinx daemon
Need to re-index data when it is changing

Sphinx config file

source src1 { type = mysql sql_host = 127.0.0.1 sql_user = root sql_pass = sql_db = dw sql_port = 3309 # optional, default is 3306 sql_query = \ SELECT msg_id, region_code, status+0 as status_code, UNIX_TIMESTAMP(submition_date) AS date_added, 't' as content \ FROM messages_dw sql_attr_uint = region_code sql_attr_uint = status_code sql_attr_timestamp = date_added sql_query_info = SELECT * FROM messages_dw WHERE msg_id=$id } index messages_dw { source = src1 path = /data1/arubin/sphinx_new//var/data/test1 docinfo = extern charset_type = sbcs } indexer { mem_limit = 32M } searchd { listen = localhost:3312:mysql41 log = /data1/arubin/sphinx_new//var/log/searchd.log query_log = /data1/arubin/sphinx_new//var/log/query.log read_timeout = 30 max_children = 30 pid_file = /data1/arubin/sphinx_new//var/log/searchd.pid max_matches = 10000000 seamless_rotate = 1 preopen_indexes = 0 unlink_old = 1 }

Tom Hanlon

October 1, 2009 at 11:30 am

This is great news. Wonder if I could use PHP mysql functions for this ? I will give it a try.

arubin

October 1, 2009 at 11:41 am

Tom,

php api is available as well.
GroupBy example is here: http://www.sphinxsearch.com/docs/current.html#api-funcgroup-groupby

Nike Air Max Shoes

March 9, 2011 at 5:17 pm

I found so many interesting in your blog especially on how to determine the topic. keep up the good work

Alexander Rubin's Blog on MySQL

Reporting Queries with Sphinx

Using Sphinx

Conclusion

Sphinx config file

3 comments to Reporting Queries with Sphinx

Recent Posts

Blogroll

Meta

Alexander Rubin's Blog on MySQL

Tags

Reporting Queries with Sphinx

Using Sphinx

Conclusion

Sphinx config file

3 comments to Reporting Queries with Sphinx

Recent Posts

Blogroll

Meta