Monday, 30 October 2023

Sustainability - Selecting right database technology to optimize carbon footprint

I have written a lot and have presented a lot on time series data management in past. Mostly, I focused on reduction of storage, cost, development efforts and improvement of performance. However, revisiting it with an angle of sustainability gives an interesting perspective.

Sustainability in IT is directly related to how efficiently you manage data. Primarily, how efficiently you process data, store data and transfer data over the network. Selection of right database technology can influence first two directly.

In this blog, I would like to talk about efficiently handling time series data which can help drastically reduce carbon footprint by reducing storage requirement up to 70% and improving processing by 30 times.

Often, timeseries database is mis-understood as No-SQL database, but unfortunately very few people know that timeseries is a specific technology and is all about how time series data can be stored and processed efficiently.

So let us understand this time series world and how selecting right database technology can help drastically reduce carbon footprint.

Who generates time series data?

Traditionally, time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communication engineering and largely in domain of applied science and engineering which involves temporal measurements. With emergence of the Internet of Things (IoT) and proliferation of connected devices, we are seeing more and more time series data is generated via sensors. And hence, irrespective of the domain, time series data is generated almost everywhere: Capital Markets, Energy and Utility, Telecommunications, Manufacturing, Logistics, Scientific Research, Intelligent Transportation and many more.

How does it look like?

The time series data has internal structure that differs from relational data. Many applications require to store data at frequent intervals that require massive storage capacity. For these reasons, it is not sufficient to manage time series data using traditional relational approach of storing one row for each time series entry. This increase storage and processing requirements exponentially and increase the carbon footprint drastically.


Informix TimeSeries handles it efficiently:

IBM's Informix TimeSeries feature provides a solution to this problem with breakthrough technology. The Informix TimeSeries feature is a combination of a TimeSeries data type and a large set of in-built analytical functions. How it manages time series data can be understood from my decade old article published in DBTA Magazine - Managing Time Series Data with Informix - Database Trends and Applications (dbta.com)

It can reduce the storage requirement by more than 50%, improve performance by orders of magnitude. With the integration of the TimeSeries feature with NoSQL/JSON and In-memory datawarehouse capabilities, it can handle heterogeneous and unstructured time series data, and run real-time analytics at speed of thought. The capability to store data up to hertz frequency further enhances its reach in different industries. And the rolling window feature eases out challenge of purging the humongous data periodically. Moreover, the way TimeSeries is structured, one can just keep inserting millions of records and still the performance will remain consistent without performing any database tuning activities, which drastically reduces processing requirements.


For more details and more solutions refer the red book I co-authored - Solving Business Problems with Informix TimeSeries (ibm.com)

Conclusion

Time series doesn’t necessarily mean a NoSQL data. It has its own structure and if right database technology like Informix TimeSeries is chosen to handle this data, one can drastically reduce the carbon footprint.


Disclaimer: The postings on this site are my own and don't necessarily represent IBM's position, strategies or opinions

Monday, 14 March 2016

Power of Expression Based Virtual Table Interface

Its been long time I haven't posted any blog. Have been busy with customer engagements. But then thought of sharing few good timeseries use cases that I handled using Expression Based Virtual Table Interface (EVTI).

All the examples below shall be explaining different queries based of the  following table structure:
create row type series
(   ts datetime year to fraction (5),   val1 float,   val2 float,  val3 float );

create row type float4
(  ts datetime year to fraction(5),  f_val1 float,f_val2 float,f_val3 float,f_val4 float );

create table ops_data
(
   meter_id integer,
   data timeseries(row1)
) ;



- As simple use case, need realtime aggregate of 1 minute data to 15min. Need avg, max,first and last value of val1

execute procedure TSCreateExpressionVirtualTab ('ops_data_agg15',
'ops_data','aggregateby("avg(val1),max(val1),last(val1),first(val1)","cal_15min",series,0,$ts_begin_time, $ts_end_time)::timeseries(float4))','float4');


- Val1 might have -ve values, hence need to run aggregates on absolute value of val1

execute procedure TSCreateExpressionVirtualTab ('ops_data_agg15',
'ops_data','aggregateby("avg(val1),max(val1),last(val1),first(val1)","cal_15min",abs(series),0,$ts_begin_time, $ts_end_time)::timeseries(float4))','float4');


- Need (current - previous) value of average of val1 and along with max, last and first value.

execute procedure TSCreateExpressionVirtualTab ('ops_data_agg15',
'ops_data','apply("$f_val1 - tsprevious($f_val1), f_val2,f_val3_f_val4", aggregateby("avg(val1),max(val1),last(val1),first(val1)","cal_15min",abs(series),0,$ts_begin_time, $ts_end_time)::timeseries(float4))','float4');


Few things to be noted in above query:
- Function "apply" is used and the timeseries created by aggregateby function is used as an argument to apply function
- tsprevious function is used to keep track of previous record and then subtract it from current value
- The output of function aggregateBy type cased to timeseries(float4), hence the column names used in apply as the named available in float4 row.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions,strategies or opinions

Monday, 12 January 2015

Informix TimeSeries - Some Best Practices for Beginners

I have been working with quite a few partners and customers who are using Informix TimeSeries. However, I realized that there is a set of instructions which is common and very much useful for all new users to take best out of the TimeSeries implementation, we will call it as best practices. And hence, I thought of writing this blog which will be useful for those whom I am working with as well whom I don’t know but they are using Informix TimeSeries. These tips will at least get you things done right in first go and later you can tune your server or queries as per your setup. Basic assumption is that, that you are using Informix Enterprise or Advance Enterprise edition on multi-core server. If you are using lower version, fragmenting table and setting PDQPRIORITY won’t be applicable.

A)      About containers, you can create multiple containers across same dbspace but try to keep container which keeps meta information of container, different for each container.
Example:
execute procedure TSContainerCreate('ts_tab1', 'data8dbs_1','ts_row',1024, 1024,
'2012-01-01 00:00:00.00000'::datetime year to fraction(5),'month',0,0,'data8dbs1,data8dbs2',1024,1024,0)

execute procedure TSContainerCreate('ts_tab2', 'data8dbs_2','ts_row',1024, 1024,
'2012-01-01 00:00:00.00000'::datetime year to fraction(5),'month',0,0,'data8dbs3,data8dbs4',1024,1024,0)

B) Create timeseries table say ts_tab with first and next extent size 1024
create table ts_tab
  (
    device_id varchar(32),
    meter_data timeseries(ts_row),
    primary key (device_id) 
  ) fragment by round robin in dbs1, dbs2
extent size 1024 next size 1024 lock mode row;

C) While initializing timeseries, you can assign containers in round robin fashion to each timeseries like:
insert into ts_tab values('tag1', TSCreateIrr('ts_1min','2014-10-01 00:00:00.00000',0,0,0, 'ts_tab1),
insert into ts_tab values('tag2', TSCreateIrr('ts_1min','2014-10-01 00:00:00.00000',0,0,0, 'ts_tab2'),
insert into ts_tab values('tag3', TSCreateIrss('ts_1min','2014-10-01 00:00:00.00000',0,0,0, 'ts_tab1'),
insert into ts_tab values('tag4', TSCreateIrr('ts_1min','2014-10-01 00:00:00.00000',0,0,0, 'ts_tab2'),
And so on..
This is example, use appropriate calendar, origin etc as per your setup.

Ensure you run “update statistics high for table ts_tab” after initializing all timeseries and before starting data upload.

D) As far as onconfig is concerned, I observed setting following parameters gives excellent performance. I have used onconfig which I used for 100K meter IDs. These values might be different for your setup, but this section gives you an idea about the parameters to be looked at:

Parameter             Values                    Remarks
LOGSIZE              102400                Just changing parameter in onconfig won't help. You have to create new log
 files with this size and delete older ones.
PHYSFILE            4194304             Create a dbspace of size 4GB with default page size. And use onparams
                                                          command to create new PHYSFILE of 4GB. You have to take 0 level backup
                                                          after this step. You can run ontape -s -L 0 -t /dev/null to take dummy backup.
VP_MEMORY_CACHE_KB 256
CKPTINTVL        600
RTO_SERVER_RESTART               0
STACKSIZE         128
BUFFERPOOL     size=**,buffers=??,lrus=8,lru_min_dirty=80,lru_max_dirty=90

#?? - Set this value to 2.5 * number of meters. 
#** page size of a dbspace/s in which containers are created

MULTIPROCESSOR 1
VPCLASS cpu,num=??,noage      #Change ?? to the number of cores you have on your machine
VP_MEMORY_CACHE_KB 128
SINGLE_CPU_VP 0

E) Set PDQPRIORITY to 60 in your environment or in SQL sessions.

F) While querying data from timeseries set :
set isolation to dirty read;
set lock mode to wait 10;
<after this you can add your select query>

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions,strategies or opinions

Friday, 3 October 2014

Informix is the only commercial database available on HP Moonshot

The new Moonshot servers ship with Canonical’s Ubuntu Linux pre-installed, as well as a stack of software from ngnix that includes memecache and other web programs. IBM’s Informix is also offered as an option - the only commercial database currently available for Moonshot..

http://www.pcworld.com/article/2688912/hps-moonshot-server-now-packs-64bit-arm-chips.html

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions,strategies or opinions

Friday, 12 September 2014

Solving Business Problems with Informix TimeSeries - A Redbook

This IBM Redbooks® publication is for people who want to implement a solution that revolves around time-based data. It gives you the information that you need to get started and be productive with Informix TimeSeries.

Table of contents

Chapter 1. Overview of IBM Informix
Chapter 2. A use case for the Informix TimeSeries feature
Chapter 3. Defining your TimeSeries environment
Chapter 4. Implementing Informix TimeSeries
Chapter 5. Querying TimeSeries data
Chapter 6. Managing the ecosystem
Appendix A. Reference material
Appendix B. Enterprise historian database example
Appendix C. Distribution grid monitoring enabler


http://www.redbooks.ibm.com/abstracts/sg248021.html

Author(s)