Import Incremental Data to Data Mart

<< Click to Display Table of Contents >>

Current:  Data Mart > VooltDB > Data Extraction 

Import Incremental Data to Data Mart

Previous pageReturn to chapter overviewNext page

Import Incremental Data to Data Mart:Import data into VooltDB database regularly according to certain conditions.

When execute"Incremental Import Data to Data Mart", the data will be imported to the VooltDB database. For example to illustrate that because the data in the database may change, users can create incremental data import tasks to monitor the data in a specific data set at 8 o'clock every day, so that the user can make timely processing.

 

The steps to create new "Import Incremental Data to Data Mart" are as follows:

1. In the task area of the new job or the New Task screen, select the task type "Incremental to Data Mart" as shown in the figure below.

增量导入类型

There are two ways to incrementally import data : "normal increment" and "dynamic incremental update", and the default is "normal increment". "normal incremental" is to import the full amount of data or additional data into the data mart; "Dynamic incremental update" can not only update the historical data in the data mart according to configuration time, but also add new data into the data mart.

2. Data set, select the data set to be imported into the data mart, it is required. You can click "Refresh Data Set" to synchronize the data set list. When the method of incremental data import is "normal incremental", the user can select all data sets in the data set drop-down list; when the method of incremental data import is "dynamic incremental update", the user can only choose data sets that can be pushed down to the database for execution.

数据集名称

 

 

3.In version 8.5, we update the data precision so that there is no precision problem when calculated large amounts of data,  the number of precision bits supprot setting. Click the Edit Format button on the right side of the dataset to display the double, float type data columns in the dataset, and set the storage precision on the market folder. Different market folders can have different precisions, Double, float type data column default value is 4,as shown in the figure below.

编辑格式

 

设置编辑格式

 

4.Enter the name of the data mart folder generated by incremental import data or select the generated data mart folder. It is a mandatory item. When the user enters or selects an existing data mart folder, the system will specify duplication prompts based on whether it is Append or not to prevent the user from covering the existing data mart folders in unknown situation, deleting the existing data mart files which will lead to data loss. As shown in the figure below:

 

已经存在

 

5.Set the name of the file that will be used to generate "Incremental Import Data to Data Mart".

文件夹文件前缀

6. Set other attributes of normal increment.

import data property

 

Yonghong Desktop only supports Append,Split and Set Meta Tag.

[Key] The user set the primary key to accelerate the query, and the primary key column cannot contain NULL values.

[Dimension table] When the user clicks Dimension table, for star data of distributed system (one large table and several small tables), the data in small tables can be duplicated to each VooltDB Node. Execute Map side join to improve data reading and processing speed.

[Append] When the user selects append, the newly generated data file will be appended to the folder without deleting the existing data file. For example, the user creates an incremental data import task, that is, collects data in a data set at 8 o'clock every day to generate a data mart file. The file name generated on the first day is test0, and the file name generated on the second day is test1. And so on. When the user does not check Add, if the market folder already exists in the current system, a new data market folder will be created to replace it. For example, the user creates an incremental data import task, that is, collects data in a data set at 8 o'clock every day to generate a data mart file. The files generated on the first day are placed in folder0, and new data is generated on the second day. The market folder folder0 is used to overwrite the existing folders and files.

[Sharding Column] When the user checks the fragmentation column, the data column is stored in a consistent hash mode. When two data sets are imported into the market incrementally, the number of columns is selected after checking the fragmented columns; when the types of columns are matched, the fragmentation is used as the link condition, and the Mapsidejoin can be realized in combination.

[Split] Implement split by column, and import the data to the mart in parallel. When the user clicks partition, partition dialog box will pop up automatically. Split type includes "Split by Average"  and "Split by Group". Only one  split column can be chosen for "Split by Average", and the split number should be integer.The default is automatic splitting, and the number of splits is input by the user, The default is 4,as shown in the figure below. 

分割1

When canceling the automatic splitting, the user can manually input the split value. The manual splitting is to group the data according to the split value, and then the grouped data is stored in zb, and the number of zb is set to divide the number of copies, and it is necessary to pay attention to the input form. When dividing the value, if the number of split values exceeds the range of [n-1, n+1], click the OK button and the prompt message is as follows: The custom split value can only be set in [n-1, n+1] For the range, delete the split value of the number of extra lines.

As shown in the figure below, divide the Double column and divide it according to Double<3 ; 3≤ Double ≤8 ; Double>8 ; Double is empty, and divide the number of copies into 2 copies.

分割2

For example: 3. "Split by Group" cannot fill in the split number. You can select multiple split columns. In order not to impact the efficiency of importing data, suggest that the number of split column is no more than 10. "Split by Group" can automatically generate Meta information to facilitate data filtration in data mart. Where, the key in Meta is the corresponding column name of split column, and the value in Meta is the corresponding value of split column. When the user does not click split, the data mart files will be generated according to the system default settings.

[Setmeta] In version 8.6, we added the attribute of setmeta to incremental import data, which is used to label the specified columns of the data set entering the mart. At the same time, we solved the problem that when the SQL load of the data set is slow, using the method of grouping and splitting to type meta will make the job run too long.

Before version 8.6, we could only label the data set entering the market by group segmentation. However, due to the problem of segmentation implementation mechanism, when the SQL execution time of the data set itself is relatively long, segmentation will increase the consumption of resources and prolong the running time of the job. After 8.6, the setmeta attribute is introduced to solve this problem. The restriction condition of using setmeta is the same as that of grouping segmentation (the total number of data rows / different values of data segmentation columns > 262144 rows), and the effect is consistent with grouping segmentation. However, when the SQL running time is short, setmeta will not accelerate the job running like the grouping segmentation. Only when the SQL running time is long, the running speed of setmeta will be less than that of grouping segmentation.

7. Set the properties of dynamic incremental update, as shown in the following figure:

dynamic increment update

[Update by column] The update by column can only be of date type or timestamp type, which is required. The user can select the original column of the data set or create a new calculated column, but only one new calculated column can be created. After the new calculated column is created, the calculated column can be edited. You can click "Refresh Update by Columns" to synchronize the columns of the data set.

[Date level] The date level includes year, quarter, month, week and day. The default is year, which is required. After selecting the date level, the bazaar file will be generated according to the selected date level.

[Update by time range] Set the start date and end date of the market data to be updated, which is required. Start date and end date can be selected as built-in parameters of today, yesterday, this week, last week, this month, last month, this quarter, last quarter, this year, and last year, or you can select a specific time.

[Update periodically] Set the period of time to update the market data, the default is from the past 3 years to today, which is required.

[More advanced settings] After expanding more advanced settings, users can use page splitting, which is not checked by default. When the database used by the data set supports paging, and the amount of data in the data set is large and needs to be quickly split into the market, you can use paging splitting. After checking the page split, you can sort by the primary key column in descending order and each page is paged by 1000000 rows by default. The primary key column is the update basis column by default, and users can set the number of rows per page and the primary key column according to their needs. Databases that support paging include: MySQL, Oracle, DB2, SqlServer, Gbase8a, Kylin, SysbaseIQ, Hana.

8. Click to add filter condition. Click [Edit] button to pop up the filter dialog box, click to increase the filter conditions, select the column that needs to set the filter, as shown below. The system will only perform incremental import actions on data that meets the filter criteria.

过滤1 

过滤2

9. Users can control the incremental import data through scripts. The user can set the names of the data mart folders, pass parameters, append, and set properties of the data mart files through scripts. When this task is performed, the script has the highest priority.

Script statements

Description

Example

folder

Create a data mart folder

folder="CloudTest";

append

Append or not

append=true;

setMeta

Set attributes for data mart file

setMeta("data",new Date(2012,9,19));//data mart file The date attribute is 2012,9,19 which can be checked based on the filter conditions of the data set in data set mart. 

param

Transfer parameter

param["market"]="East";//market covers all data of East. The parameter market is from corresponding data set.  

10. When the job type is "Incremental Import Data to Data Mart" and the operation is successful, after the result of the current job interface and history job interface is "successful", an icon of the data file dialog box that can trigger the import to the market will appear.

任务运行结果

The system will automatically add time information to the data which is successfully imported into the market. Information format: "Job triggering time _ job name". For the data set in the new data mart, all data imported to the mart can be filtered via Meta information.

11. When selecting a data set in the current page, the recommended number of columns for the data set to enter the market is 50, and exceeding this value will indicate that it may be slow, as shown below:

推荐行数

 

 

Delete Incremental Data

In the  job status page, the user is allowed to delete the data successfully imported to the mart at the last time, and reserve the imported data. If the user consecutively import data to the mart via Append mode, and data error occurred while being imported to the mart for the second time, the user can find the corresponding job execution record in the historical job status page, open the task running result dialog box, delete the data imported to the mart, and remain the other data.

delete file