SCM

[#1011280] If table has indexes I lose data

View Trackers | Bugs | Download .csv | Monitor

Date:
2012-11-16 14:40
Priority:
3
State:
Closed
Submitted by:
Ali POUYA (alipouya)
Assigned to:
Nobody (None)
Category:
None
Group:
None
Resolution:
None
 
Summary:
If table has indexes I lose data

Detailed description
Hi,
I downloded pg_bulkload-3.1.2.tar.gz on Unix 2.6, compiled and used it successfully. Thanks.

I load a 36 million line CSV file into an empty table.
If the table has no indexes I get all of the 36M lines in the table.
If it has only one index I get less (around 21M).
If it has two indexes I get still less (around 18M).
I know that there are no duplicates in the file.
The report is OK in all cases (pg_bulkload says it loaded 36M lines)

Below you find some details.
I can send more details if required.

Can somebody help me ?
Thanks a lot

================================
More information :
System :
Linux xxxxxxx 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux

Control File :
OUTPUT = public.t1 # [<schema_name>.]table_name
INPUT = /var/log/savis/export_corrige.csv # Input data location (absolute path)
TYPE = CSV # Input file type
#QUOTE = "\"" # Quoting character
#ESCAPE = \ # Escape character for Quoting
DELIMITER = "," # Delimiter
PARSE_ERRORS = -1
DUPLICATE_ERRORS = -1
CHECK_CONSTRAINTS=NO
ON_DUPLICATE_KEEP=NEW
WRITER=PARALLEL

Followup

Message
Date: 2013-04-09 13:49
Sender: Ali POUYA

Hi Takashi,
I tested the new version with success in both PostgreSQL 9.16 and 9.2.4 versions. It work fine !
The time ratio between th parallel and direct modes is around 3 to 4, which is very important for big volumes of data.

Thank you very much for this beautiful work.
I will be an advertiser of pg_bulkload.
This bug can be closed
Best regards
Date: 2013-04-08 08:10
Sender: Takashi Ohnishi


Hi.

Sorry to become late...
Thanks for your bug report.

Web fixed the bug and have released as new version 3.1.3.
I hope this version will help you.
Date: 2013-02-11 12:29
Sender: Ali POUYA

I encounter the problem in two different environments with different parameters : PostgreSQL 9.1.3 and 9.2.3.
In the attached archive (bug_1011280.tar.gz) you find the settings for each environment.
The version of pg_bulkload is always the same : 3.1.2. I compile the source and use it in Linux Fedora with kernel 2.6 for 64 bit architecture.

In the archive you find also a self contained test case which reproduces the problem.
To reproduce the problem proceed as follows :

1) Deploy the archive bug_1011280.tar.gz
2) Adapt the file environment to your's
3) Run . ./environment
4) Run ./prepare.sh : you prepare the table public.foo with two indexes and the csv file 10000000.csv containing ten million lines.
5) Run ./test.sh. At the end you must find ten million rows in the table public.foo but you will find fewer rows (around 4 million).

You can retry test.sh several times if you wish.

***

More details :
If you drop the indexes the problem does not occur !
Also if you work in DIRECT mode instead of PARALLEL mode the problem does not occur.
When checking the code I find that there are fewer TupleParserRead() than write_queue() invocations (there are as many as the rows actually found in the table).

May be this is a problem of concurrent access to the Queue structure and data ?
Date: 2012-11-16 14:43
Sender: Ali POUYA

Hi
I forgot to give the PostgreSQL version. It is : 9.1.3
Ali Pouya

Attached Files:

Attachments:
bug_1011280.tar.gz

Changes:

Field Old Value Date By
status_idOpen2013-04-11 04:27oonishitk
close_dateNone2013-04-11 04:27oonishitk
File Added791: bug_1011280.tar.gz2013-02-11 12:29alipouya
Powered By FusionForge