Valter Douglas Lisb=C3=B4a Jr. wrote:
> Hello all, I have a perl script thats load a entire day squid log to
a=20
> postgres table. I run it at midnight by cronjob and turns off the
indexes=
=20
> before do it (turning it on after). The script works fine, but I want
to=
=20
> change this to a diferent approach.
>
> I'd like to insert on the fly the log lines, so long it be generated to
h=
ave=20
> the data on-line. But the table has some indexes and the load of lines
is=
=20
> about 300.000/day, so the average inserting is 3,48/sec. I think this
cou=
ld=20
> overload the database server (i did not test yet), so if I want to
create=
a=20
> no indexed table to receive the on-line inserting and do a job moving
all=
=20
> lines to the main indexed table at midnight.
>
> My question is, Does exists a better solution, or this tatic is a good
wa=
y to=20
> do this?
The average matters less than the peak. Unless your traffic is even=20
24x7, your rate will be higher. If your log is concentrated in an
8-hour=20
workday, your average daytime rate will be closer to 10/second with=20
peaks that are much higher. You might consider some form of buffering=20
between the Squid log and the database to avoid blocking. Your current=20
method has the advantage of moving the database workload to off-hours.
Instead of moving data, you might look into partitioning your data. How=20
long do you keep your logs actively available in PostgreSQL? I know one=20
company that partitions their log data into months (parent table with=20
child table for each month). They keep 12-months of data live so they=20
rotate through the child tables. At the start of a month, that month's=20
table is truncated. Modify as appropriate for your load - perhaps a=20
partition (child-table) for each day. Or a current-day child-table that=20
is migrated into a main-table nightly. Either way you can make it
appear=20
that the parent-table is an up-to-date complete table.
You will need to do some reading on table partitioning if you go this=20
route. Pay special attention to the requirements needed to optimize
queries.
You might also want to check your stats tables to make sure the indexes=20
you currently maintain are actually used by your queries and remove any=20
that are unnecessary to reduce index-maintenance overhead.
Another possible technique would be to have a nightly process that=20
creates partial-indexes. One set of indexes would cover all data prior=20
to midnight and the other set all data after midnight. Depending on the=20
nature of your "real-time" vs. historical queries, these might even be=20
different indexes. You will have to tweak your queries to make use of=20
your indexes but your live data won't have to update your "historical"=20
indexes. Warning: the date-constraint in the partial index must be=20
static - you can't do something like "...where squidlog_timestamp >=20
current_date...". Your nightly process will be creating new indexes=20
with a new date-constraint. You might even be able to get away with=20
having no indexes on the current-day's data and just recreate
historical=20
indexes nightly (similar to your no-index with nightly-insert).
But don't try the above till you determine you have a problem. On
modest=20
3-year-old non-dedicated (also running file-storage, rsync backup,=20
mail...) hardware with basic SATA RAID1 we are handling a similar load=20
without strain.
Cheers,
Steve
--=20
Sent via pgsql-general mailing list (pgsql-general@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


|