Quantcast

count(*) performance improvement ideas

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

count(*) performance improvement ideas

Pavan Deolasee
I am reading discussion about improving count(*) performance. I have
also seen a TODO for this.

Many people have suggested TRIGGER based solution to the slow count(*)
problem. I looked at the following link which presents the solution
neatly.

http://www.varlena.com/GeneralBits/120.php

But how does that really work for SERIALIZABLE transactions ? If
two concurrent transactions INSERT/DELETE rows from a table,
the trigger execution of one of the transactions is bound to fail
because of concurrent access. Even for READ COMMITTED transactions,
the trigger execution would wait if the other transaction has executed
the trigger on the same table. Well, I think the READ COMMITTED case
can be handled with DEFERRED triggers, but that may require queuing up
too many triggers if there are many inserts/deletes in a transaction.

Running trigger for every insert/delete seems too expensive. I wonder
if we can have a separate "counter" table (as suggested in the TRIGGER
based solution) and track total number of tuples inserted and deleted
in a transaction (and all the committed subtransactions). We then
execute a single UPDATE at the end of the transaction. With HOT,
updating the "counter" table should not be a big pain since all these
updates can potentially be HOT updates. Also since the update of
the "counter" table happens at the commit time, other transactions
inserting/deleting from the same user table may need to wait for a
very small period on the "counter" table tuple.

This still doesn't solve the serializable transaction problem
though. But I am sure we can figure out some solution for that case
as well if we agree on the general approach.

I am sure this must have been discussed before. So what are the
objections ?

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Tom Lane-2
"Pavan Deolasee" <[hidden email]> writes:
> I am sure this must have been discussed before.

Indeed.  Apparently you didn't find the threads in which the idea of
having transactions enter "delta" entries was discussed?  Solves both
the locking and the MVCC problems, at the cost that you need to make
cleanup passes over the counter table fairly often.

I don't see this as material for the core DB but it would make a great
contrib module.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Bruce Momjian
Tom Lane wrote:

> "Pavan Deolasee" <[hidden email]> writes:
> > I am sure this must have been discussed before.
>
> Indeed.  Apparently you didn't find the threads in which the idea of
> having transactions enter "delta" entries was discussed?  Solves both
> the locking and the MVCC problems, at the cost that you need to make
> cleanup passes over the counter table fairly often.
>
> I don't see this as material for the core DB but it would make a great
> contrib module.

The TODO list has lots of details on this.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Pavan Deolasee
In reply to this post by Tom Lane-2
On Wed, Mar 12, 2008 at 9:01 PM, Tom Lane <[hidden email]> wrote:
> "Pavan Deolasee" <[hidden email]> writes:
>  > I am sure this must have been discussed before.
>
>  Indeed.  Apparently you didn't find the threads in which the idea of
>  having transactions enter "delta" entries was discussed?  Solves both
>  the locking and the MVCC problems, at the cost that you need to make
>  cleanup passes over the counter table fairly often.
>

Ok. I would surely look at those threads. Hopefully HOT will considerably
solve the counter table cleanup issue.

>  I don't see this as material for the core DB but it would make a great
>  contrib module.
>

I guess we will need some backend hooks to make it really work, no ?
At the minimum we need to track the "deltas" at the transaction level
and the ability to do some extra processing at the commit time.
May be I should first read those threads and I will find the answers.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Mark Mielke
In reply to this post by Pavan Deolasee
Pavan Deolasee wrote:

> I am reading discussion about improving count(*) performance. I have
> also seen a TODO for this.
>
> Many people have suggested TRIGGER based solution to the slow count(*)
> problem. I looked at the following link which presents the solution
> neatly.
>
> http://www.varlena.com/GeneralBits/120.php
>
> But how does that really work for SERIALIZABLE transactions ? If
> two concurrent transactions INSERT/DELETE rows from a table,
> the trigger execution of one of the transactions is bound to fail
> because of concurrent access. Even for READ COMMITTED transactions,
> the trigger execution would wait if the other transaction has executed
> the trigger on the same table. Well, I think the READ COMMITTED case
> can be handled with DEFERRED triggers, but that may require queuing up
> too many triggers if there are many inserts/deletes in a transaction.
>
> Running trigger for every insert/delete seems too expensive. I wonder
> if we can have a separate "counter" table (as suggested in the TRIGGER
> based solution) and track total number of tuples inserted and deleted
> in a transaction (and all the committed subtransactions). We then
> execute a single UPDATE at the end of the transaction. With HOT,
> updating the "counter" table should not be a big pain since all these
> updates can potentially be HOT updates. Also since the update of
> the "counter" table happens at the commit time, other transactions
> inserting/deleting from the same user table may need to wait for a
> very small period on the "counter" table tuple.
>
> This still doesn't solve the serializable transaction problem
> though. But I am sure we can figure out some solution for that case
> as well if we agree on the general approach.
>
> I am sure this must have been discussed before. So what are the
> objections

If you are talking about automatically doing this for every table - I
have an objection that the performance impact seems unwarranted against
the gain. We are still talking about every insert or update updating
some counter table, with the only mitigating factor being that the
trigger would be coded deeper into PostgreSQL theoretically making it
cheaper?

You can already today create a trigger on insert that will append to a
summary table of some sort, whose only purpose is to maintain counts. At
the simplest, it is as somebody else suggested where you might have the
other table only store the primary keys with foreign key references back
to the main table for handling deletes and updates. Storing transaction
numbers and such might allow the data to be reduced in terms of size
(and therefore elapsed time to scan), but it seems complex.

If this really is a problem that must be solved - I prefer the
suggestion from the past of keeping track of live rows per block for a
certain transaction range, and any that fall within this range can check
off this block quickly with an exact count, then the exceptional blocks
(the ones being changed) can be scanned to be sure. But, it's still
pretty complicated to implement right and maintain, for what is probably
limited gain. I don't personally buy into the need to do exact count(*)
on a whole table quickly. I know people ask for it - but I find these
same people either confused, or trying to use this functionality to
accomplish some other end, under the assumption that because they can
get counts faster from other databases, therefore PostgreSQL should do
it as well. I sometimes wonder whether these people would even notice if
PostgreSQL translated count(*) on the whole table to query reltuples. :-)

Cheers,
mark

--
Mark Mielke <[hidden email]>


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Pavan Deolasee
On Wed, Mar 12, 2008 at 9:14 PM, Mark Mielke <[hidden email]> wrote:
>
>  If you are talking about automatically doing this for every table - I
>  have an objection that the performance impact seems unwarranted against
>  the gain. We are still talking about every insert or update updating
>  some counter table, with the only mitigating factor being that the
>  trigger would be coded deeper into PostgreSQL theoretically making it
>  cheaper?
>

No, I am not suggesting that. If you read proposal carefully, its one UPDATE
per transaction. With HOT, I am hoping that the counter table may be
completely cached in memory and won't bloat much.

Also, we can always have a GUC (like pgstats) to control the overhead.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Mark Mielke
Pavan Deolasee wrote:
On Wed, Mar 12, 2008 at 9:14 PM, Mark Mielke [hidden email] wrote:
  
 If you are talking about automatically doing this for every table - I
 have an objection that the performance impact seems unwarranted against
 the gain. We are still talking about every insert or update updating
 some counter table, with the only mitigating factor being that the
 trigger would be coded deeper into PostgreSQL theoretically making it
 cheaper?
    

No, I am not suggesting that. If you read proposal carefully, its one UPDATE
per transaction. With HOT, I am hoping that the counter table may be
completely cached in memory and won't bloat much.

Also, we can always have a GUC (like pgstats) to control the overhead.
  

Fine - once per transaction instead of once per insert. Still, if there is overhead to this (updating a secondary summary table), does it really make sense to have it for every table? Most of my tables do not require count(*) on the whole table (actually - none of them do). For the same reason as I don't want oid, I don't think I would want "fast count" capabilities to impact my regular queries. Again, I don't think count(*) on the whole table is a particularly useful case. count(*) on particular subsets of the data may be, but of the whole table?

If you can make a secondary summary table to be used for count(*) optional, great. If using HOT makes the secondary table more efficient, great.

Cheers,
mark

-- 
Mark Mielke [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Tom Lane-2
Mark Mielke <[hidden email]> writes:
> Fine - once per transaction instead of once per insert. Still, if there
> is overhead to this (updating a secondary summary table), does it really
> make sense to have it for every table?

We certainly wouldn't accept a patch that imposed this overhead on every
table.

One of the advantages of framing it as an explicit set of triggers is
that then you have a natural way of indicating which table(s) you want
the feature for (and are willing to pay the overhead to get it).

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Pavan Deolasee
In reply to this post by Mark Mielke
On Wed, Mar 12, 2008 at 9:53 PM, Mark Mielke <[hidden email]> wrote:

>
>
>
>  Fine - once per transaction instead of once per insert. Still, if there is
> overhead to this (updating a secondary summary table), does it really make
> sense to have it for every table? Most of my tables do not require count(*)
> on the whole table (actually - none of them do). For the same reason as I
> don't want oid, I don't think I would want "fast count" capabilities to
> impact my regular queries. Again, I don't think count(*) on the whole table
> is a particularly useful case. count(*) on particular subsets of the data
> may be, but of the whole table?
>

ISTM that you are complaining because we never had an *fast* count(*)
and adding that now comes at a cost. Had it been there from day one with
the same overhead as we are talking about now, nobody would have
complained :-)

Anyways, your point is taken and it would be great if can make it configurable,
if not table level then at least globally.


Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Mark Mielke
Pavan Deolasee wrote:
On Wed, Mar 12, 2008 at 9:53 PM, Mark Mielke [hidden email] wrote:
  

 Fine - once per transaction instead of once per insert. Still, if there is
overhead to this (updating a secondary summary table), does it really make
sense to have it for every table? Most of my tables do not require count(*)
on the whole table (actually - none of them do). For the same reason as I
don't want oid, I don't think I would want "fast count" capabilities to
impact my regular queries. Again, I don't think count(*) on the whole table
is a particularly useful case. count(*) on particular subsets of the data
may be, but of the whole table?

    

ISTM that you are complaining because we never had an *fast* count(*)
and adding that now comes at a cost. Had it been there from day one with
the same overhead as we are talking about now, nobody would have
complained :-)
  

Quite possibly - but, similarly, if PostgreSQL was generally slower due to such things, I might not have chosen PostgreSQL as my favoured db. :-) I think there is a reason PostgreSQL doesn't come with a fast count(*), and it's that the developers leading up to this point shared the opinion that this feature was not critical.

I somewhat disagree about my personal reaction, though. I complain about many things, even configurable things, such as LISTEN/NOTIFY. I occasionally wish I had such a function, but given it's implementation, I would choose to use a non-PostgreSQL mechanism for implementation if available. Luckily, it's also "only suffer the performance cost if you CHOOSE to use it." Seems like a historically valuable trend.

Anyways, your point is taken and it would be great if can make it configurable,
if not table level then at least globally

If table level, I might even try it out. If global level, I will not be trying it out. I'm only one person - but perhaps this will help point you in a direction favourable to many?

If you are REALLY feeling clever, the queries that I would find it to benefit the MOST on, would include WHERE conditions. Just like indexes with WHERE conditions. If you get to this point, I think you are reaching something that will have far more universal benefit to existing applications. CREATE COUNT INDEX blah ON blah WHERE ... I would use this in many places.

Cheers,
mark

-- 
Mark Mielke [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Gregory Stark-2
In reply to this post by Tom Lane-2

"Tom Lane" <[hidden email]> writes:

> Mark Mielke <[hidden email]> writes:
>> Fine - once per transaction instead of once per insert. Still, if there
>> is overhead to this (updating a secondary summary table), does it really
>> make sense to have it for every table?
>
> We certainly wouldn't accept a patch that imposed this overhead on every
> table.

If you look at this at the right angle it's actually a degenerate case of
materialized views. I think think it would be more useful to approach it from
that direction even if it only supported a very limited set of expressions.

In an ideal world I would love to be able to do something like:

CREATE MATERIALIZED VIEW foo AS (select count(*) from bar) WITH INCREMENTAL UPDATES;

and have that automatically create both a heap to store the count and another
to store the incremental changes. To do this would require some magic to know
what "incremental changes" means for each aggregate where it's meaningful
though.

Then it would require some magic in the optimizer to recognize when piece of
the query can be satisfied by a materialized view.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Mark Mielke
Gregory Stark wrote:

> In an ideal world I would love to be able to do something like:
>
> CREATE MATERIALIZED VIEW foo AS (select count(*) from bar) WITH INCREMENTAL UPDATES;
>
> and have that automatically create both a heap to store the count and another
> to store the incremental changes. To do this would require some magic to know
> what "incremental changes" means for each aggregate where it's meaningful
> though.
>
> Then it would require some magic in the optimizer to recognize when piece of
> the query can be satisfied by a materialized view.
>  

Hehe... +5 if you think it can be done in the next 12 - 24 months. :-)

Cheers,
mark

--
Mark Mielke <[hidden email]>


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Stephen Denne
In reply to this post by Tom Lane-2
 

Tom Lane wrote
> "Pavan Deolasee" <[hidden email]> writes:
> > I am sure this must have been discussed before.
>
> Indeed.  Apparently you didn't find the threads in which the idea of
> having transactions enter "delta" entries was discussed?  Solves both
> the locking and the MVCC problems, at the cost that you need to make
> cleanup passes over the counter table fairly often.

I can't find any posts that directly address what I was looking for.

In my situation I have a small number of concurrent transactions with each transaction running a large number of single row inserts or deletes.

However I'm not after a fast count(*) from table, but more like a fast
        select grouping_id, count(*) from my_table group by grouping_id

I initially thought that what you meant by "having transactions enter "delta" entries" was that I have a trigger that would create a row each time it was called, writing how many records where inserted or deleted. I didn't understand how this would be much of an improvement, as each of my rows would contain either +1 or -1.

But I just realised you might mean to include a txid row in my table of deltas, and in my trigger insert or update that row where txid = txid_current()
(per grouping_id)

Is that what is recommended?

No locking problems as each transaction is only updating its own rows.

Can you clarify the lack of MVCC problems?
Do new versions of rows get created if the original version of the row was created in the current transaction?
Does this idea apply with the same efficiency in pre 8.3, non-HOT implementations?
Any advice on creating or avoiding indexes on the tables in question?

I can think of two options for a performing the cleanup passes using current functionality:
1) Using Listen/Notify
Issue a notify whenever a new row is inserted for a transaction. They get delivered post transaction commit don't they? And not at all on rollback? Have an application listening for them, performing the aggregation & cleanup work.
2) Use a schedule, based on local knowledge of expected number of transactions over time.

So I'd rephrase Pavan's suggestion as a request to have post-transaction commit triggers that have access to (at least) the txid of the transaction that was committed.

Suggested syntax is to add the option "TRANSACTION" (or perhaps "COMMIT") to the CREATE TRIGGER statement:

CREATE TRIGGER name AFTER INSERT OR UPDATE OR DELETE ON table FOR EACH TRANSACTION EXECUTE PROCEDURE funcname ( arguments );

Any of the triggering actions on the specified table ensure that the function is called once if the transaction commits.
Requires a new TG_LEVEL.
TG_OP could be the first action triggered.

Would implementing this be extremely difficult due to transferring information from within the transaction to outside the transaction?
If so, perhaps I'd get the same benefit from having a trigger set up to fire pre-commit (or pre-prepare), and be a part of the transaction.
Would the locking difficulties be reduced as the lock would not be required till late in the game, and not held for long?

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Tom Lane-2
"Stephen Denne" <[hidden email]> writes:
> I initially thought that what you meant by "having transactions enter "delta" entries" was that I have a trigger that would create a row each time it was called, writing how many records where inserted or deleted. I didn't understand how this would be much of an improvement, as each of my rows would contain either +1 or -1.

Well, ideally you'd aggregate all the deltas caused by a particular
transaction into one entry in the counting table.  Whether or not that
happens, though, the point of the concept is that some background task
aggregates all the deltas from long-gone transactions into just one base
row, and then deletes the old delta entries.  To get a valid value of
COUNT(*), what onlookers must do is SUM() the base row and delta records
from all transactions that they can "see" under MVCC rules.  The amount
of work involved is proportional to the number of recent updates, not
the total size of the underlying table.

> However I'm not after a fast count(*) from table, but more like a fast
> select grouping_id, count(*) from my_table group by grouping_id

You could apply the same technique across each group id, though this
certainly is getting beyond what any built-in feature might offer.

> Can you clarify the lack of MVCC problems?

The point there is that the "right answer" actually depends on the
observer, since each observer might have a different snapshot and
therefore be able to "see" a different set of committed rows in the
underlying table.  The multiple-delta table handles this automatically,
because you can "see" a delta entry if and only if you could "see"
the underlying-table changes it represents.

> Does this idea apply with the same efficiency in pre 8.3, non-HOT implementations?

I didn't claim it was amazingly efficient in any implementation ;-).
HOT in particular is nearly useless since most rows in the count
table will never be updated, only inserted and eventually deleted.
You might get some mileage on the base row, but that'd be about it.
The count table will need frequent vacuums as well as frequent
aggregation scans.

It should beat scanning a large underlying table, but it's hardly
gonna be free.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Gregory Stark-2
"Tom Lane" <[hidden email]> writes:

> "Stephen Denne" <[hidden email]> writes:
>
>> However I'm not after a fast count(*) from table, but more like a fast
>> select grouping_id, count(*) from my_table group by grouping_id
>
> You could apply the same technique across each group id, though this
> certainly is getting beyond what any built-in feature might offer.

At that point you're talking about materialized views. Which makes it a whole
lot more interesting imho.

>> Does this idea apply with the same efficiency in pre 8.3, non-HOT implementations?
>
> I didn't claim it was amazingly efficient in any implementation ;-).
> HOT in particular is nearly useless since most rows in the count
> table will never be updated, only inserted and eventually deleted.
> You might get some mileage on the base row, but that'd be about it.
> The count table will need frequent vacuums as well as frequent
> aggregation scans.

It might be better not to update this delta table in normal transactional
updates. After all the tuples you're deleting are precisely the ones that
nobody should be interested in any more. If you locked the table and magically
deleted those tuples and updated the master tuple using the global xmin
instead of your real xid people would get the same result and you could
reclaim the space much much sooner. Locking the table kind of sucks though.
And crash recovery would be a problem.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Stephen Denne
In reply to this post by Tom Lane-2
(Sorry for the repeat email Tom, I forgot the cc's)

Tom Lane wrote:

> "Stephen Denne" <[hidden email]> writes:
> > I initially thought that what you meant by "having
> > transactions enter "delta" entries" was that I have a trigger
> > that would create a row each time it was called, writing how
> > many records where inserted or deleted. I didn't understand
> > how this would be much of an improvement, as each of my rows
> > would contain either +1 or -1.
>
> Well, ideally you'd aggregate all the deltas caused by a particular
> transaction into one entry in the counting table.  

Yes, that's what I was attempting to do, but without changing the application code.

Using txid_current() can do that, so each of a large number of individual inserts or deletes within a transaction updates the same delta row for that transaction. I haven't found any references to this being a solution, and thought it was worth recording.

> > However I'm not after a fast count(*) from table, but more
> like a fast
> > select grouping_id, count(*) from my_table group by grouping_id
>
> You could apply the same technique across each group id, though this
> certainly is getting beyond what any built-in feature might offer.

Agreed. I've tried it out now, and am fairly happy with what I've got.

> > Can you clarify the lack of MVCC problems?
>
> The point there is that the "right answer" actually depends on the
> observer, since each observer might have a different snapshot and
> therefore be able to "see" a different set of committed rows in the
> underlying table.  The multiple-delta table handles this
> automatically,
> because you can "see" a delta entry if and only if you could "see"
> the underlying-table changes it represents.
>
> > Does this idea apply with the same efficiency in pre 8.3,
> non-HOT implementations?
>
> I didn't claim it was amazingly efficient in any implementation ;-).
> HOT in particular is nearly useless since most rows in the count
> table will never be updated, only inserted and eventually deleted.
> You might get some mileage on the base row, but that'd be about it.
> The count table will need frequent vacuums as well as frequent
> aggregation scans.
>
> It should beat scanning a large underlying table, but it's hardly
> gonna be free.

What I was asking about with those questions, is if a single row is inserted in transaction A, and updated 1000 times still within transaction A, then transaction A commits... does a single row version get written, or 1001, 1000 of which are removable?

I tested this with a small summary table freshly vacuum full'ed.

10 transactions, one after the other, each transaction inserted three delta rows, and updates one of them three times, and the other two five times each.
So 3 inserts and 13 updates per transaction.
The updates only affect non-indexed fields in rows created in the same transaction.

The next vacuum full found 52 removable row versions.

I repeated the test, and got 13 removable row versions.

I repeated the test again, and got 13 removable row versions.

I repeated just one of the ten transactions, 13 removable row versions.

All inserts and updates are probably in the one page that has a fair bit of free space.

Is it possible to update the HOT code to re-use row versions on the same page if they were created in the same transaction?

Conclusion: making use of txid_current(), I can get single delta rows with deltas of 10000, but doing so creates 10000 dead row versions.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Tom Lane-2
"Stephen Denne" <[hidden email]> writes:
> What I was asking about with those questions, is if a single row is inserted in transaction A, and updated 1000 times still within transaction A, then transaction A commits... does a single row version get written, or 1001, 1000 of which are removable?

Umm ... AFAICS there is no need for an UPDATE to touch the count table
at all.  You'd only need ON INSERT and ON DELETE triggers.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Stephen Denne
Tom Lane wrote
> "Stephen Denne" <[hidden email]> writes:
> > What I was asking about with those questions, is if a
> > single row is inserted in transaction A, and updated 1000
> > times still within transaction A, then transaction A
> > commits... does a single row version get written, or 1001,
> > 1000 of which are removable?
>
> Umm ... AFAICS there is no need for an UPDATE to touch the count table
> at all.  You'd only need ON INSERT and ON DELETE triggers.

I'm not referring to updates of my base table... the single row inserted was referring to the delta row...

I'm trying to group together a large number of +1 or -1 deltas into a single delta per transaction.

A cut down example:

CREATE TABLE document_type_summary_detail
(
  document_type_id integer NOT NULL,
  txid bigint NOT NULL DEFAULT 0,
  documents bigint NOT NULL DEFAULT 0,
  CONSTRAINT pk_document_type_summary PRIMARY KEY (document_type_id, txid)
);

CREATE OR REPLACE FUNCTION document_count_trig()
RETURNS TRIGGER AS
$$
   DECLARE
   BEGIN
   IF TG_OP = 'INSERT' THEN
      UPDATE document_type_summary_detail set documents=documents+1 where document_type_id = NEW.document_type_id and txid=txid_current();
      IF NOT FOUND THEN
         INSERT INTO document_type_summary_detail (document_type_id,documents,txid) VALUES(NEW.document_type_id,1,txid_current());
      END IF;
      RETURN NEW;
   ELSIF TG_OP = 'DELETE' THEN
      UPDATE document_type_summary_detail set documents=documents-1 where document_type_id = OLD.document_type_id and txid=txid_current();
      IF NOT FOUND THEN
         INSERT INTO document_type_summary_detail (document_type_id,documents,txid) VALUES(OLD.document_type_id,-1,txid_current());
      END IF;
      RETURN OLD;
   END IF;
   END;
$$
LANGUAGE 'plpgsql';
       
create trigger document_count_trig before insert or delete on document for each row execute procedure document_count_trig();

--one off setup:
insert into document_type_summary_detail (document_type_id,txid,documents)
select dt.id, 0, count(d.id) from document_type dt left join document d on d.document_type_id = dt.id
group by 1,2;


--useful view:
CREATE OR REPLACE VIEW document_type_summary AS SELECT document_type_id, sum(documents) AS documents FROM document_type_summary_detail GROUP BY document_type_id;


--scheduled cleanup:
CREATE OR REPLACE FUNCTION document_type_summary_aggregate() RETURNS void AS
$$
   BEGIN
      INSERT INTO document_type_summary_detail(document_type_id) select distinct document_type_id from document_type_summary_detail except select document_type_id from document_type_summary_detail where txid=0;
      UPDATE document_type_summary_detail set documents=v.documents from document_type_summary as v where document_type_summary_detail.document_type_id = v.document_type_id and document_type_summary_detail.txid=0 and exists (select 1 from document_type_summary_detail ss where ss.document_type_id = document_type_summary_detail.document_type_id and ss.txid <> 0);
      DELETE FROM document_type_summary_detail where txid <> 0;
   END;
$$
LANGUAGE 'plpgsql';


My assumption is that this solves the "locking causes serialisation of transactions" problem as the only rows updated are those inserted by the same transaction.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Tom Lane-2
"Stephen Denne" <[hidden email]> writes:
> Tom Lane wrote
>> Umm ... AFAICS there is no need for an UPDATE to touch the count table
>> at all.  You'd only need ON INSERT and ON DELETE triggers.

> I'm not referring to updates of my base table... the single row inserted was referring to the delta row...

> I'm trying to group together a large number of +1 or -1 deltas into a single delta per transaction.

Oh.  You can't realistically do that from the level of individual ON
INSERT/DELETE triggers --- as you've found out, you end up with about
the same number of dead rows anyway.  Collapsing a transaction's changes
into a single row would require keeping some intra-transaction state,
which is do-able at the C-code level but not from plpgsql.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: count(*) performance improvement ideas

Mark Mielke
In reply to this post by Tom Lane-2
Tom Lane wrote:
"Stephen Denne" [hidden email] writes:
  
What I was asking about with those questions, is if a single row is inserted in transaction A, and updated 1000 times still within transaction A, then transaction A commits... does a single row version get written, or 1001, 1000 of which are removable?
    

Umm ... AFAICS there is no need for an UPDATE to touch the count table
at all.  You'd only need ON INSERT and ON DELETE triggers.
  

This returns to the question of whether count of the whole table is useful, or whether count of a GROUP BY or WHERE is useful. If GROUP BY or WHERE is useful, then trigger on UPDATE becomes necessary.

What is the direction here? Is it count of the whole table only? (<-- not interesting to me)

Or count of more practical real life examples, which I completely agree with Greg, that this gets into the materialized view realm, and becomes very interesting.

In my current db project, I never count all of the rows in a table. However, I do use count(*) with GROUP BY and WHERE.

Cheers,
mark

-- 
Mark Mielke [hidden email]

123
Loading...