Quantcast

Re: [HACKERS] Cost of XLogInsert CRC calculations

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Manfred Koizar
On Wed, 18 May 2005 13:50:22 +0200, I wrote:
>The most important figure is, that at MaxSpeed (/O2) 2x32 is almost
>twice as fast as CRC32 while only being marginally slower than CRC32.
                  ^^^^^
Silly typo!  That should have been:
The most important figure is, that at MaxSpeed (/O2) 2x32 is almost
twice as fast as CRC64 while only being marginally slower than CRC32.

Servus
 Manfred


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2

 

> -----Original Message-----
> From: Manfred Koizar [mailto:[hidden email]]
> Sent: 25 May 2005 20:25
> To: Manfred Koizar
> Cc: Tom Lane; Greg Stark; Bruce Momjian; Mark Cave-Ayland
> (External); [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> The most important figure is, that at MaxSpeed (/O2) 2x32 is
> almost twice as fast as CRC64 while only being marginally
> slower than CRC32.
>
> Servus
>  Manfred





---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2
In reply to this post by Manfred Koizar

> -----Original Message-----
> From: Manfred Koizar [mailto:[hidden email]]
> Sent: 25 May 2005 20:25
> To: Manfred Koizar
> Cc: Tom Lane; Greg Stark; Bruce Momjian; Mark Cave-Ayland
> (External); [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> The most important figure is, that at MaxSpeed (/O2) 2x32 is
> almost twice as fast as CRC64 while only being marginally
> slower than CRC32.
>
> Servus
>  Manfred


Hi Manfred,

Sorry about taking a while to respond on this one - the hard drive on my
laptop crashed :(. I repeated your tests on my P4 laptop with gcc 3.2.3 and
reproduced the results below:


Opt 32     32a    32b    2x32   64     64a      64b
--------------------------------------------------------
O1 4.91   4.86   5.43   6.00   11.4   11.39    11.39
O2 4.96   4.94   4.69   5.18   15.86  18.75    24.73
O3 4.82   4.83   4.64   5.18   15.14  13.77    14.73

                           ^^^^^^^^^^^^

So in summary I would say:

        - Calculating a CRC64 using 2 x 32 int can be 3 times as fast as
using 1 x 64 int on
        my 32-bit Intel laptop with gcc.
       
        - The time difference between CRC32 and CRC64 is about 0.5s in the
worse case
        shown during testing, so staying with CRC64 would not inflict too
great a penalty.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
"Mark Cave-Ayland" <[hidden email]> writes:
> Opt 32     32a    32b    2x32   64     64a      64b
> --------------------------------------------------------
> O1 4.91   4.86   5.43   6.00   11.4   11.39    11.39
> O2 4.96   4.94   4.69   5.18   15.86  18.75    24.73
> O3 4.82   4.83   4.64   5.18   15.14  13.77    14.73

Not sure I believe these numbers.  Shouldn't 2x32 be about twice as slow
as just one 32-bit CRC?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2

> -----Original Message-----
> From: Tom Lane [mailto:[hidden email]]
> Sent: 27 May 2005 15:00
> To: Mark Cave-Ayland (External)
> Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian';
> [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations
>
>
> "Mark Cave-Ayland" <[hidden email]> writes:
> > Opt 32     32a    32b    2x32   64     64a      64b
> > --------------------------------------------------------
> > O1 4.91   4.86   5.43   6.00   11.4   11.39    11.39
> > O2 4.96   4.94   4.69   5.18   15.86  18.75    24.73
> > O3 4.82   4.83   4.64   5.18   15.14  13.77    14.73
>
> Not sure I believe these numbers.  Shouldn't 2x32 be about
> twice as slow as just one 32-bit CRC?

Well it surprised me, although Manfred's results with VC6 on /MaxSpeed show
a similar margin. The real killer has to be that I wrote a CRC32 routine in
x86 inline assembler (which in comparison to the gcc-produced version stores
the CRC for each iteration in registers instead of in memory as part of the
current frame) which comes in at 6.5s....


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2
In reply to this post by Tom Lane-2

> -----Original Message-----
> From: Tom Lane [mailto:[hidden email]]
> Sent: 27 May 2005 15:00
> To: Mark Cave-Ayland (External)
> Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian';
> [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> Not sure I believe these numbers.  Shouldn't 2x32 be about twice as
> slow as just one 32-bit CRC?

Also I've just quickly tested on the Xeon Linux FC1 box I used with my
original program using Manfred's program and the margin is even closer:

Opt 32     32a    32b    2x32   64     64a    64b
------------------------------------------------------
O1    2.75   2.81   2.71   3.16   3.53   3.64   7.25
O2    2.75   2.78   2.87   2.94   7.63   10.61  11.93
O3    2.84   2.85   3.03   2.99   7.63   7.64   7.71

I don't know whether gcc is just producing an inefficient CRC32 compared to
2x32 but the results seem very odd.... There must be something else we are
missing?


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
"Mark Cave-Ayland" <[hidden email]> writes:
> I don't know whether gcc is just producing an inefficient CRC32 compared to
> 2x32 but the results seem very odd.... There must be something else we are
> missing?

I went back and looked at the code, and see that I was misled by
terminology: what we've been calling "2x32" in this thread is not two
independent CRC32 calculations, it is use of 32-bit arithmetic to execute
one CRC64 calculation.  The inner loop looks like

    while (__len-- > 0)
    {
        int        __tab_index = ((int) (__crc1 >> 24) ^ *__data++) & 0xFF;

        __crc1 = crc_table1[__tab_index] ^ ((__crc1 << 8) | (__crc0 >> 24));
        __crc0 = crc_table0[__tab_index] ^ (__crc0 << 8);
    }

whereas a plain CRC32 looks like

    while (__len-- > 0)
    {
        int        __tab_index = ((int) (crc >> 24) ^ *__data++) & 0xFF;

        crc = crc_table[__tab_index] ^ (crc << 8);
    }

where the crc variables are uint32 in both cases.  (The true 64-bit
calculation looks like the latter, except that the crc variable is
uint64, as is the crc_table, and the >> 24 becomes >> 56.  The "2x32"
code is an exact emulation of the true 64-bit code, with __crc1 and
__crc0 holding the high and low halves of the 64-bit crc.)

In my tests the second loop is about 10% faster than the first on an
Intel machine, and maybe 20% faster on HPPA.  So evidently the bulk of
the cost is in the __tab_index calculation, and not so much in the table
fetches.  This is still a bit surprising, but it's not totally silly.

Based on the numbers we've seen so far, one could argue for staying
with the 64-bit CRC, but changing the rule we use for selecting which
implementation code to use: use the true 64-bit code only when
sizeof(unsigned long) == 64, and otherwise use the 2x32 code, even if
there is a 64-bit unsigned long long type available.  This essentially
assumes that the unsigned long long type isn't very efficient, which
isn't too unreasonable.  This would buy most of the speedup without
giving up anything at all in the error-detection department.

Alternatively, we might say that 64-bit CRC was overkill from day one,
and we'd rather get the additional 10% or 20% or so speedup.  I'm kinda
leaning in that direction, but only weakly.

Comments?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Bruce Momjian-2
Tom Lane wrote:
> Alternatively, we might say that 64-bit CRC was overkill from day one,
> and we'd rather get the additional 10% or 20% or so speedup.  I'm kinda
> leaning in that direction, but only weakly.

Yes, I lean in that direction too since the CRC calculation is showing
up in our profiling.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  [hidden email]               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2
In reply to this post by Tom Lane-2

> -----Original Message-----
> From: Tom Lane [mailto:[hidden email]]
> Sent: 27 May 2005 17:49
> To: Mark Cave-Ayland (External)
> Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian';
> [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> I went back and looked at the code, and see that I was misled by
> terminology: what we've been calling "2x32" in this thread is
> not two independent CRC32 calculations, it is use of 32-bit
> arithmetic to execute one CRC64 calculation.

Yeah, I did find the terminology a little confusing until I looked at the
source itself. It doesn't make much sense publishing numbers if you don't
know their meaning ;)

> Based on the numbers we've seen so far, one could argue for
> staying with the 64-bit CRC, but changing the rule we use for
> selecting which implementation code to use: use the true
> 64-bit code only when sizeof(unsigned long) == 64, and
> otherwise use the 2x32 code, even if there is a 64-bit
> unsigned long long type available.  This essentially assumes
> that the unsigned long long type isn't very efficient, which
> isn't too unreasonable.  This would buy most of the speedup
> without giving up anything at all in the error-detection department.

All our servers are x86 based Linux with gcc, so if a factor of 2 speedup
for CPU calculations is the minimum improvement that we get as a result of
this thread then I would be very happy.

> Alternatively, we might say that 64-bit CRC was overkill from
> day one, and we'd rather get the additional 10% or 20% or so
> speedup.  I'm kinda leaning in that direction, but only weakly.

What would you need to persuade you either way? I believe that disk drives
use CRCs internally to verify that the data has been read correctly from
each sector. If the majority of the errors would be from a disk failure,
then a corrupt sector would have to pass the drive CRC *and* the PostgreSQL
CRC in order for an XLog entry to be considered valid. I would have thought
the chances of this being able to happen would be reasonably small and so
even with CRC32 this can be detected fairly accurately.

In the case of an OS crash then we could argue that there may be a partially
written sector to the disk, in which case again either one or both of the
drive CRC and the PostgreSQL CRC would be incorrect and so this condition
could also be reasonably detected using CRC32.

As far as I can tell, the main impact of this would be that we would reduce
the ability to accurately detect multiple random bit errors, which is more
the type of error I would expect to occur in RAM (alpha particles etc.). How
often would this be likely to occur? I believe that different generator
polynomials have different characteristics that can make them more sensitive
to a particular type of error. Perhaps Manfred can tell us the generator
polynomial that was used to create the lookup tables?


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
"Mark Cave-Ayland" <[hidden email]> writes:
>> Alternatively, we might say that 64-bit CRC was overkill from
>> day one, and we'd rather get the additional 10% or 20% or so
>> speedup.  I'm kinda leaning in that direction, but only weakly.

> What would you need to persuade you either way? I believe that disk drives
> use CRCs internally to verify that the data has been read correctly from
> each sector. If the majority of the errors would be from a disk failure,
> then a corrupt sector would have to pass the drive CRC *and* the PostgreSQL
> CRC in order for an XLog entry to be considered valid. I would have thought
> the chances of this being able to happen would be reasonably small and so
> even with CRC32 this can be detected fairly accurately.

It's not really a matter of backstopping the hardware's error detection;
if we were trying to do that, we'd keep a CRC for every data page, which
we don't.  The real reason for the WAL CRCs is as a reliable method of
identifying the end of WAL: when the "next record" doesn't checksum you
know it's bogus.  This is a nontrivial point because of the way that we
re-use WAL files --- the pages beyond the last successfully written page
aren't going to be zeroes, they'll be filled with random WAL data.

Personally I think CRC32 is plenty for this job, but there were those
arguing loudly for CRC64 back when we made the decision originally ...

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Greg Stark-3

Tom Lane <[hidden email]> writes:

> It's not really a matter of backstopping the hardware's error detection;
> if we were trying to do that, we'd keep a CRC for every data page, which
> we don't.  The real reason for the WAL CRCs is as a reliable method of
> identifying the end of WAL: when the "next record" doesn't checksum you
> know it's bogus.  This is a nontrivial point because of the way that we
> re-use WAL files --- the pages beyond the last successfully written page
> aren't going to be zeroes, they'll be filled with random WAL data.

Is the random WAL data really the concern? It seems like a more reliable way
of dealing with that would be to just accompany every WAL entry with a
sequential id and stop when the next id isn't the correct one.

I thought the problem was that if the machine crashed in the middle of writing
a WAL entry you wanted to be sure to detect that. And there's no guarantee the
fsync will write out the WAL entry in order. So it's possible the end (and
beginning) of the WAL entry will be there but the middle still have been
unwritten.

The only truly reliable way to handle this would require two fsyncs per
transaction commit which would be really unfortunate.

> Personally I think CRC32 is plenty for this job, but there were those
> arguing loudly for CRC64 back when we made the decision originally ...

So given the frequency of database crashes and WAL replays if having one
failed replay every few million years is acceptable I think 32 bits is more
than enough. Frankly I think 16 bits would be enough.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
Greg Stark <[hidden email]> writes:
> Tom Lane <[hidden email]> writes:
>> It's not really a matter of backstopping the hardware's error detection;
>> if we were trying to do that, we'd keep a CRC for every data page, which
>> we don't.  The real reason for the WAL CRCs is as a reliable method of
>> identifying the end of WAL: when the "next record" doesn't checksum you
>> know it's bogus.

> Is the random WAL data really the concern? It seems like a more reliable way
> of dealing with that would be to just accompany every WAL entry with a
> sequential id and stop when the next id isn't the correct one.

We do that, too (the xl_prev links and page header addresses serve that
purpose).  But it's not sufficient given that WAL records can span pages
and therefore may be incompletely written.

> The only truly reliable way to handle this would require two fsyncs per
> transaction commit which would be really unfortunate.

How are two fsyncs going to be better than one?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Greg Stark-3
Tom Lane <[hidden email]> writes:

> > Is the random WAL data really the concern? It seems like a more reliable way
> > of dealing with that would be to just accompany every WAL entry with a
> > sequential id and stop when the next id isn't the correct one.
>
> We do that, too (the xl_prev links and page header addresses serve that
> purpose).  But it's not sufficient given that WAL records can span pages
> and therefore may be incompletely written.

Right, so the problem isn't that there may be stale data that would be
unrecognizable from real data. The problem is that the real data may be
partially there but not all there.

> > The only truly reliable way to handle this would require two fsyncs per
> > transaction commit which would be really unfortunate.
>
> How are two fsyncs going to be better than one?

Well you fsync the WAL entry and only when that's complete do you flip a bit
marking the WAL entry as commited and fsync again.

Hm, you might need three fsyncs, one to make sure the bit isn't set before
writing out the WAL record itself.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
Greg Stark <[hidden email]> writes:
> Tom Lane <[hidden email]> writes:
>>> Is the random WAL data really the concern? It seems like a more reliable way
>>> of dealing with that would be to just accompany every WAL entry with a
>>> sequential id and stop when the next id isn't the correct one.
>>
>> We do that, too (the xl_prev links and page header addresses serve that
>> purpose).  But it's not sufficient given that WAL records can span pages
>> and therefore may be incompletely written.

Actually, on reviewing the code I notice two missed bets here.

1. During WAL replay, we aren't actually verifying that xl_prev matches
the address of the prior WAL record.  This means we are depending only
on the page header addresses to make sure we don't replay stale WAL data
left over from the previous cycle of use of the physical WAL file.  This
is fairly dangerous, considering the likelihood of partial write of a
WAL page during a power failure: the first 512-byte sector(s) of a page
may have been updated but not the rest.  If an old WAL record happens to
start at exactly the sector boundary then we lose.

2. We store a separate CRC for each backup block attached to a WAL
record.  Therefore the same torn-page problem could hit us if a stale
backup block starts exactly at a intrapage sector boundary --- there is
nothing guaranteeing that the backup block really goes with the WAL
record.

#1 seems like a pretty critical, but also easily fixed, bug.  To fix #2
I suppose we'd have to modify the WAL format to store just one CRC
covering the whole of a WAL record and its attached backup block(s).

I think the reasoning behind the separate CRCs was to put a limit on
the amount of data guarded by one CRC, and thereby minimize the risk
of undetected errors.  But using the CRCs in this way is failing to
guard against exactly the problem that we want the CRCs to guard against
in the first place, namely torn WAL records ... so worrying about
detection reliability seems misplaced.

The odds of an actual failure from case #2 are fortunately not high,
since a backup block will necessarily span across at least one WAL page
boundary and so we should be able to detect stale data by noting that
the next page's header address is wrong.  (If it's not wrong, then at
least the first sector of the next page is up-to-date, so if there is
any tearing the CRC should tell us.)  Therefore I don't feel any need
to try to work out a back-patchable solution for #2.  But I do think we
ought to change the WAL format going forward to compute just one CRC
across a WAL record and all attached backup blocks.  There was talk of
allowing compression of backup blocks, and if we do that we could no
longer feel any certainty that a page crossing would occur.

Thoughts?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Manfred Koizar
In reply to this post by Mark Cave-Ayland-2
On Tue, 31 May 2005 12:07:53 +0100, "Mark Cave-Ayland"
<[hidden email]> wrote:
>Perhaps Manfred can tell us the generator
>polynomial that was used to create the lookup tables?

 32   26   23   22   16   12   11   10   8   7   5   4   2   1
X  + X  + X  + X  + X  + X  + X  + X  + X + X + X + X + X + X + 1

-> http://www.opengroup.org/onlinepubs/009695399/utilities/cksum.html

Or google for "04c11db7".

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Simon Riggs
In reply to this post by Tom Lane-2
On Tue, 2005-05-31 at 12:27 -0400, Tom Lane wrote:

> Greg Stark <[hidden email]> writes:
> > Tom Lane <[hidden email]> writes:
> >>> Is the random WAL data really the concern? It seems like a more reliable way
> >>> of dealing with that would be to just accompany every WAL entry with a
> >>> sequential id and stop when the next id isn't the correct one.
> >>
> >> We do that, too (the xl_prev links and page header addresses serve that
> >> purpose).  But it's not sufficient given that WAL records can span pages
> >> and therefore may be incompletely written.
>
> Actually, on reviewing the code I notice two missed bets here.
>
> 1. During WAL replay, we aren't actually verifying that xl_prev matches
> the address of the prior WAL record.  This means we are depending only
> on the page header addresses to make sure we don't replay stale WAL data
> left over from the previous cycle of use of the physical WAL file.  This
> is fairly dangerous, considering the likelihood of partial write of a
> WAL page during a power failure: the first 512-byte sector(s) of a page
> may have been updated but not the rest.  If an old WAL record happens to
> start at exactly the sector boundary then we lose.

Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't
used, but passed that by. Damn.

> 2. We store a separate CRC for each backup block attached to a WAL
> record.  Therefore the same torn-page problem could hit us if a stale
> backup block starts exactly at a intrapage sector boundary --- there is
> nothing guaranteeing that the backup block really goes with the WAL
> record.
>
> #1 seems like a pretty critical, but also easily fixed, bug.  To fix #2
> I suppose we'd have to modify the WAL format to store just one CRC
> covering the whole of a WAL record and its attached backup block(s).
>
> I think the reasoning behind the separate CRCs was to put a limit on
> the amount of data guarded by one CRC, and thereby minimize the risk
> of undetected errors.  But using the CRCs in this way is failing to
> guard against exactly the problem that we want the CRCs to guard against
> in the first place, namely torn WAL records ... so worrying about
> detection reliability seems misplaced.
>
> The odds of an actual failure from case #2 are fortunately not high,
> since a backup block will necessarily span across at least one WAL page
> boundary and so we should be able to detect stale data by noting that
> the next page's header address is wrong.  (If it's not wrong, then at
> least the first sector of the next page is up-to-date, so if there is
> any tearing the CRC should tell us.)  Therefore I don't feel any need
> to try to work out a back-patchable solution for #2.  But I do think we
> ought to change the WAL format going forward to compute just one CRC
> across a WAL record and all attached backup blocks.  There was talk of
> allowing compression of backup blocks, and if we do that we could no
> longer feel any certainty that a page crossing would occur.
>
> Thoughts?

PreAllocXLog was already a reason to have somebody prepare new xlog
files ahead of them being used. Surely the right solution here is to
have that agent prepare fresh/zeroed files prior to them being required.
That way no stale data can ever occur and both of these bugs go
away....

Fixing that can be backpatched so that the backend that switches files
can do the work, rather than bgwriter [ or ?].

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
Simon Riggs <[hidden email]> writes:
> Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't
> used, but passed that by. Damn.

I couldn't believe it'd been overlooked this long, either.  It's the
sort of thing that you assume got done the first time :-(

> PreAllocXLog was already a reason to have somebody prepare new xlog
> files ahead of them being used. Surely the right solution here is to
> have that agent prepare fresh/zeroed files prior to them being required.

Uh, why?  That doubles the amount of physical I/O required to maintain
the WAL, and AFAICS it doesn't really add any safety that we can't get
in a more intelligent fashion.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Simon Riggs
On Tue, 2005-05-31 at 22:36 -0400, Tom Lane wrote:
> Simon Riggs <[hidden email]> writes:
> > Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't
> > used, but passed that by. Damn.
>
> I couldn't believe it'd been overlooked this long, either.  It's the
> sort of thing that you assume got done the first time :-(

Guess it shows how infrequently PostgreSQL crashes and recovers.

> > PreAllocXLog was already a reason to have somebody prepare new xlog
> > files ahead of them being used. Surely the right solution here is to
> > have that agent prepare fresh/zeroed files prior to them being required.
>
> Uh, why?  That doubles the amount of physical I/O required to maintain
> the WAL, and AFAICS it doesn't really add any safety that we can't get
> in a more intelligent fashion.

OK, I agree that the xl_prev linkage is the more intelligent way to go.

If I/O is a problem, then surely you will agree that PreAllocXLog is
still required and should not be run by a backend? Thats going to show
as a big response time spike for that user.

Thats the last bastion - the other changes are gonna smooth response
times right down, so can we do something with PreAllocXLog too?

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Mark Cave-Ayland-2
In reply to this post by Tom Lane-2

> -----Original Message-----
> From: Tom Lane [mailto:[hidden email]]
> Sent: 31 May 2005 17:27
> To: Greg Stark
> Cc: Mark Cave-Ayland (External); 'Manfred Koizar'; 'Bruce
> Momjian'; [hidden email]
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> The odds of an actual failure from case #2 are fortunately
> not high, since a backup block will necessarily span across
> at least one WAL page boundary and so we should be able to
> detect stale data by noting that the next page's header
> address is wrong.  (If it's not wrong, then at least the
> first sector of the next page is up-to-date, so if there is
> any tearing the CRC should tell us.)  Therefore I don't feel
> any need to try to work out a back-patchable solution for #2.
>  But I do think we ought to change the WAL format going
> forward to compute just one CRC across a WAL record and all
> attached backup blocks.  There was talk of allowing
> compression of backup blocks, and if we do that we could no
> longer feel any certainty that a page crossing would occur.

I must admit I didn't realise that an XLog record consisted of a number of
backup blocks which were also separately CRCd. I've been through the source,
and while the XLog code is reasonably well commented, I couldn't find a
README in the transam/ directory that explained the thinking behind the
current implementation - is this something that was discussed on the mailing
lists way back in the mists of time?

I'm still a little nervous about dropping down to CRC32 from CRC64 and so
was just wondering what the net saving would be using one CRC64 across the
whole WAL record? For example, if an insert or an update uses 3 backup
blocks then this one change alone would immediately reduce the CPU usage to
one third of its original value? (something tells me that this is probably
not the case as I imagine you would have picked this up a while back). In my
view, having a longer CRC is like buying a holiday with insurance - you pay
the extra cost knowing that should anything happen then you have something
to fall back on. However, the hard part here is determining a reasonable
balance betweem the cost and the risk.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [HACKERS] Cost of XLogInsert CRC calculations

Tom Lane-2
In reply to this post by Simon Riggs
Simon Riggs <[hidden email]> writes:
> If I/O is a problem, then surely you will agree that PreAllocXLog is
> still required and should not be run by a backend?

It is still required, but it isn't run by backends --- it's fired off
during checkpoints.  I think there was some discussion recently about
making it more aggressive about allocating future segments; which
strikes me as a good idea.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

12
Loading...