Design Features of a Well Managed Talend Project

Let me summarize 8 years of working with Talend Data Integration into a few tips:All Context parameters keep their values in the database.  

Skip the drama with file paths across different OS’es. Keep context values in an easy-to-edit db table.  My favorite is Mysql with a Navicat client.

Ideally the system has quality default values that then get updated with the appropriate context value.   

All the logging goes to tables that correspond to tLogCatcher, tStatCatcher, tFlowMeter.

You’re going to need a reliable db no matter what logging system you use so may as well leverage that db for quality logging that again doesn’t fail because of file paths.

Old log content gets purged at least every 30 days.  Someone in your IT hierarchy is going to convince you that you need execution history beyond 30 days for some lean six sigma management best practice but no one will ever actually dig through that.

Only create one context environment variable that dictates which environment the job is running in.

Why? I once destroyed a week of testing because one of my DEV jobs accidently used QA settings and it was able to connect to the dev resources. Use the ENV variable to pull the right values from your context table.

Let the database do it’s job.  

Staging tables exist for a reason.  If you’re doing a massive join between two data feeds there is a very good chance that staging the feeds in the same db and using ELT components (vs. standard ETL components) will dramatically increase performance.

For work that runs in batch mode, keep your ETL Control table in order.

Make small jobs that do something specific.  Control what executes next in an ETL control table. If you aren’t using an established tool for coordinating and scheduling execution of jobs then use an ETL control table.  

Skip the giant pyramid of child jobs.

I’ve never seen anything good happen because job scheduling logic was embedded in the defintion of a parent job vs. just schedule the jobs with an etl control table or use the scheduling features of TAC.

If Google has a well-defined approach to a data management task then at least read up on it and mimic it.

“Google’s mission is to organize the world’s information and make it universally accessible and useful.” It is an extremely safe bet that if you can use Google’s services for Master Data then it will cost far less in time and errors to go with master data hosted by Google.  The most obvious examples is Google Maps or Google address data.  Talend offers components to pull cleansed addresses.