martes, 26 de julio de 2011

Auditing table changes

How often are you asked for logging changes in tables? AFAIR, every single customer i have had asked me for that... Until now, i have using table_log (http://pgfoundry.org/projects/tablelog) for that but this is less than
ideal for several reasons.

what table_log does is to create a new table for every table we track with the same columns as the original table has plus some added columns at the end of the table, this has a few problems:
  1. if we add new columns to the original table we need to contort things to add the same columns in the correct place in the log table if we don't want to lose history.
  2. if you drop a column, you should do the same on the log table or you couldn't be able to insert new records. in this case the solution is to lose some history
  3. this inserts 1 record per INSERT, 1 record per DELETE and 2 records per UPDATE (the old and new versions of the record are stored); so in heavily updated tables size can be double of the real one
  4. no support for logging truncate actions
  5. there is no simple way to determine which is the old and new versions of a single tuple, you need to match new versions with old versions using time of change, user and mode (UPDATE)
  6. there is missing information, at least ip of the client
for these reasons, i have made a little trigger to make this a little better (https://github.com/jcasanov/pg_audit).

what i have done here is:
  • use hstore, and ensure only log the values that have changed. because of that, if columns are added or removed we don't need to do anything to the logger table.
  • it inserts only 1 record per action so use less space than table_log
  • it logs TRUNCATE actions (it needs an additional trigger for that, though)
  • old and new values are on the same record so is really obvious what have changed in the tuple
  • i have added two fields that there aren't in table_log current_user and client_addr (i'm not sure how useful current_user is but surely the ip can be useful)
  • i have used only one table for all tables tracked, that way if we want to have separate tables for some we can partition by relid
i have written this in plpgsql but can think on move it to c if it seems useful

5 comentarios:

  1. This is perfect but i have one question : What if we have only one db user and many other -virtual- application users ( eg. some web application ) and want to log user actions ? It can be done on application level but looks like doing on db it's more sane.

    Regards.

    ResponderEliminar
  2. if you only have one db user then there's not much you can do.
    one option is to see the ip address field but i can't believe that will be accurate unless you can force users to always connect from the same ip address.
    the other option is an ugly hack that you can do from 9.0+, set application_name to something like "app=your_app_name user=your_app_user" or just with the user and parse that in the trigger.
    Having said that, probably if you have only one db user that user connects through an application that has predefined actions and probably logs that kind of high level info anyway.

    ResponderEliminar
  3. hello,
    thank you for your response. i thought i can create another table and write oplog ( userid, opid, recordid, time ) and do cross query when i need but using variables looks more logical.

    ResponderEliminar
  4. Actually, I have a very similar setup and find setting application_name from my app to work nicely.

    Particularly good is the fact that I get "psql" when I go in and manually change things.

    ResponderEliminar
  5. Minor quibble about "point #2" - you don't necessarily need to drop old columns from the audit table; you merely need them to be NULLable, and stop collecting them.

    Jan Wieck would have some interest in this; replication systems like Slony and Londiste need to capture analogous information in their "logging triggers," and the tradeoffs are quite similar. Jan's interested in arriving at some "deeply more efficient" representations for such, and has been keen on getting a binary-ish form that would resemble the heap tuple format that is actually stored internally.

    We'd benefit from having:
    1. A format that doesn't need SQL parsing (e.g. - does not pay for parsing when processing it)

    2. A format that doesn't need SQL parsing (e.g. - if you want to grab one column's data, you don't need to do fancy SQL re-parsing to try to figger out where it is)

    3. A format that doesn't repetitively "blabber" column names in every tuple. (hstore mayn't be the ideal thing, at least not with the use of column names as keys)

    ResponderEliminar