Details about MSSQL FDW and query speedup ways


Introduction

Here I want to describe in details a FDW for MSSQL and several aproaches to speedup of queries over external tables. If in short: I want to say about follows:

  • differences between major versions of TDS FDW,
  • compatibility issues of FreeTDS and MSSQL.
  • approach to a query-optimization in case of foreign tables,
  • caching of an external data,
  • idea of an exotic way of speedup access to remote DB.

Installation and base configuration

In Internet presents lots of places with description of the installation and configuration process. So, I just leave here several links:

And a single note: do not use

sudo make USE_PGXS=1 install

from the tds_fdw nstallation manual, use

sudo USE_PGXS=1 checkinstall

instead.

Differences between major versions of the TDS FDW

For now (mid 2016) there is two actual versions of TDS FDW: 1.0.7 and 2.0.0-alpha1. Here is a list of differences (not full):

  • for me, in version 2.0.0 doesn't work queries with a DISTINCT clause by foreign tables (issue on GitHub);
  • appeared support of a WHERE pushdown, but still no support of JOIN, ORDER and other pushdowns; however, a WHERE pushdown has a problem in case of usage `query` option inside a definition of a foreign table;
  • appeared support of 7.4 tds_version.

Compatibility pitfalls

Until recently, TDS FDW can work with TDS version higher then 7.3. But, during writing this article appeared the issue on GitHub. After this issue was released support of tds_version 7.4. So, now in tds_fdw presence a compatibility with all actual versions of TDS.

For me, support of different versions of TDS is important due to a follows reason. Version 7.1 is good for MSSQL 2008 and below. In case of usage this version of TDS with MSSQL 2012 or higher, appear follows errors:

DB-Library error: DB #: 20017, DB Msg: Unexpected EOF from the server
ActiveRecord::LostConnection: TinyTds::Error: closed connection: ...
TinyTds::Error: Adaptive Server connection failed

This errors may appear during any access to a foreign table. No any special event, i.e. it appears randomly. This is annoying and required an additional logic of working with foreign tables. But a solution is simple: use proper version of TDS according to a version of MSSQL.

An unpleasure moment: in default repositories of Ubuntu 14.04 and 16.04 presence a package freetds_common with versions 0.91-5 and 0.91-6.1build1 respectivly. But support of 7.4 tds_version appear in FreeTDS 0.95 and higher. So, here is two solutions:

  1. compile FreeTDS from source;
  2. use an alternative PPA with a latest stable version of FreeTDS.

In second case you need to keep in mind one thing: in this PPA presented a package only for Ubuntu 14.04 (trusty), but you can use same PPA for Ubuntu 16.04 (xenial). In this case you needs to change name of a release in an sources.list. For me, a target source.list located in /etc/apt/sources.list.d/jamiewillis-freetds-trusty.list. For proper work it must contain follows:

deb http://ppa.launchpad.net/jamiewillis/freetds/ubuntu trusty main

And anyway, remember, for working with 7.4 tds_version required not only FreeTDS, but tds_fdw started from a commit 3a803c5.

Temporary solution

If, by any reasons, you can't update FreeTDS, you have a 20017 error and want to evade it, here is an solution. In case of web-application you can handle this error and restart a failed query. In case of Ruby on Rails application and ActiveRecord this can be done as follows:

def retry_mssql_operation(tries = 5)
begin
yield
rescue ActiveRecord::StatementInvalid => e
if e.message =~ /^PG::FdwUnableToCreateExecution/ && tries > 0
tries -= 1
retry
else
raise
end
end
end

But, please, do not use this one as a permanent solution.

Couple words about pushdown

For me understanding of a pushdown seemed to be not obvious. So, I want to describe this mechanism for those who meet this in a first time.

In simplest case, when we execute something like this:

SELECT column_name FROM foreign_table WHERE column_id = 42;

In fact, following actions occur:

  1. extract all data from a foreign_table on remote server into postgres;
  2. filtration of the extracted data according to conditions from a WHERE clause inside postgres.

So, if the foreign_table has 1M rows and we want just one of them, a database still extract all rows. This is a long and expensive operation. And here appears a pushdown. Depending from implementation of a foreign storage, a foreign data wrapper can construct an optimal query and reduce amount of rows for fetching. Due to different clauses (WHERE, JOIN, ORDER etc) a FDW must implement different algorithms of a pushdown. Moreover, not all foreign data storages support a pushdown. For example, for file_fdw a pushdown is meaningless, but for postgres_fdw or tds_fdw a pushdown is extremely useful.

In other words: a pushdown must be implemented inside FDW for all available clauses presented in the SQL-language. This implementation require lots of time due to lots of cases of usage. And that is why only rare FDWs has a full pushdown support.

Speedup of queries

All below descriptions has only one aim: make a work with an external data storage (MSSQL in current case) faster.

Pushdown

This useful in case of extracting a small amount of data from a big table. For now, in the latest version of the TDS FDW a pushdown works for a simple query with a WHERE condition.

For example, you have a table simple_table in an MSSQL database. This table has two columns: id and data. In this case, the definition of a foreign table can be follows:

CREATE FOREIGN TABLE mssql_table (
id integer,
custom_data varchar OPTIONS (column_name 'data'))
SERVER mssql_svr
OPTIONS (schema_name 'dbo', table_name 'simple_table', row_estimate_method 'showplan_all', match_column_names '1');

Here the first column has an identical names in PostgreSQL and in MSSQL tables: id. The second column has different name: custom_data in PostgreSQL and data in MSSQL. That is why here required an option called column_name. This option create a mapping of PostgreSQL columns with the columns from the external DB. A last option match_column_name explicitly specify a name mapping for other columns.

So, if execute a follows query:

SELECT custom_data FROM mssql_table WHERE id = 42;

FDW must process a condition and construct an equivalent query for MSSQL. This query may look like follows:

SELECT data FROM simple_table WHERE id = 42;

If you use tds_fdw with version 1.0.7  or below, a query for MSSQL will be different:

SELECT id, data FROM simple_table;

Remember: for now, this doesn't work for any JOIN and ORDER operations as well as for functions (UPPER, LOWER, MAX etc).

One more thing: how to know which query executed on an external server during an access to a foreign table? In the explain for the MySQL FDW appear follows row:

 Remote query: SELECT `id`, `customer_id`, `order_date` FROM `data`.`orders`

And this is very comfortable. In opposite in tds_fdw no anything about a remote query. And here can help logs of FreeTDS. By default, in FreeTDS logs is disabled. This behaviour can be changed in /etc/freetds/freetds.conf. Fore enable logs needs to uncomment couple rows:

;       dump file = /tmp/freetds.log
; debug flags = 0xffff

A comment character is a semicolon. Remove it and save the config.

Now all actions of FreeTDS will be logged in all details. But, in this configuration of logging all queries require more time for execution. Moreover, for me a simple SELECT query create ~300Mb of logs and a query with JOIN generate ~1.5Gb of data. This size can be reduced by proper value of a parameter `debug flags`. So, more about logging into FreeTDS can be found on official site in the corresponding section. Details about debug flags described in the table within this section.

Materialized view

Materialized view is a view + a table with data. A materialized view is useful in case of complex queries, i.e. for queries with join of internal and external tables, with functional conditions, with a full text search and so on.

So, pros and cons of materialized view is follows. Pros: a MV is a native PostgreSQL object, hence it perfwect works with other part of DBMS, i.e. a MV can be indexed independently from a underlying data source. Cons: a MV require a manual update by triggers, or by other events, or through a full recreate, consequently this require resources and produce a time lag between a data in a MV and a data in source.

For previously described foreign table, a materialized view can be defined as follows:

CREATE MATERIALIZED VIEW materialized_mssql_table AS
SELECT id, custom_data
FROM mssql_table;

Now, we have a clone of an external data into a local PostgreSQL cluster. Hence we obtain an ability for indexing (B-tree, GIN and GiST and so on), we have a statistic for this table, a detalization about execution plan and many other tasty things from PostgreSQL.

Update of data into a MV can be done as simple insert if somehow we can receive information about changes from MSSQL, or by full recreate using follows command:

REFRESH MATERIALIZED VIEW CONCURRENTLY materialized_mssql_table;

Option CONCURRENTLY make updating of a data without full lock of a materialized view. Note: CONCURRENTLY clause has several requirement; details about them described in the corresponding documentations page.

Exotic way

I didn't use this in a production and know a little. That is why, below I describe only an idea without certain pieces of a code. My aim: to provide an information about this approach for saving time of other developers, who want to find a solution for rare problems. Because, most from popular resources say "use FDW" on any questions about integration of PostgreSQL with other DBMS.

When this will be useful: in case of an inability of using of any previously described solutions due to some limitations. I.e. if you have a huge external database and you can't clone all required data into PostgreSQL, you have an optimal query to a foreign DB, you have a dynamical parameter for this query and you want to execute exactly this query. Simple example: execute a query with an CONTAINS function with a user specified parameter inside the PostgreSQL.

The idea concluded in usage of dbi-link or dblink-tds. This is an equivalent of dblink from PostgreSQL, but with support of other DBMS: PostgreSQL, MySQL, MS-SQL Server and Oracle in case of dbi-link and simple support of TDS protocol in case of dblink-tds. If in short: with help of this tools you may create a function in PostgreSQL. This function may receive arguments, construct a query for an external DB and execute it remotely. Then, this one can receive a data and process them. Resulted function can looks like pipeline-function and implement exactly that you want in the way you want.

If you have examples of using this tools and/or you have an experience in area of direct querying over foreign DBs from PostgreSQL, please, share your knowledge with the World.

Conclusion

At the moment there is only one good tool to work with MSSQL from PostgreSQL. This is tds_fdw. This one has severails disadvantages, but this project is alive and gradually develops. There appear new features, bug fixes and this is really cool. So, tds_fdw can solve most part of production tasks related with fetching data from MSSQL. Solution for other part, related with speed and performance, can provide PostgreSQL. And, finally, a small rest part of tasks related with a deeply optimization and with unusual cases of an access to MSSQL from PostgreSQL you will have to solve alone. Because documentation is missing, support of tools is over and source code of libraries is your best friend.

PostgreSQL FDW Foreign data wrapper pgsql MSSQL tds_fdw materialized view Ruby on Rails dblink-tds dbi-link freetds Microsoft SQL Server Ubuntu ActiveRecord pushdown

comments powered by Disqus