Everyday getting better: June 2015

Wednesday, June 24, 2015

XML Streaming tip while using JAXB

Always streaming to xml files is recommended over full DOM writes.

If you are writing a big xml file and all child elements will not be written in one stretch or method call then writing individual fragments is efficient and practical.

Sample xml may look like:

<MyRoot>
<Child id="1" type="a"/>
<Child id="2" type="a"/>
<Child id="3" type="a"/>
.
.
.
<Child id="1001" type="b"/>
<Child id="1002" type="b"/>
<Child id="1003" type="b"/>
.
.
.
</MyRoot>

Lets us say <Child> elements of type "a" are to be added by method1() and type "b" are to be added by method2() then following snippet of code allows each method to independently write to same xml file in fragments. i.e without root element.

public void method1() {

try {
Employee emp1 = new Employee();
   emp1.setId(100);
   emp1.setFirstName("John");
   emp1.setLastName("Macy");
   emp1.setAge(29);

Employee emp2 = new Employee();
   emp2.setId(88);
   emp2.setFirstName("Linda");
   emp2.setLastName("Stuard");
   emp2.setAge(25);

File file = new File("C:\\myFile.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Employee.class);
Marshaller jaxbMarshaller = jaxbContext.createMarshaller();

jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
jaxbMarshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
//This property allows writing xml child nodes without root element

jaxbMarshaller.marshal(emp1, file);
jaxbMarshaller.marshal(emp2, file);
} catch (JAXBException e) {
e.printStackTrace();
}

}

myFile.xml contents after execution of this piece of code looks as follows (without root element):

<Employee>
<Id>100</Id>
<FirstName>John</FirstName>
<LastName>Macy</LastName>
  <Age>29</Age>
</Employee>
<Employee>
<Id>88</Id>
<FirstName>Linda</FirstName>
<LastName>Stuard</LastName>
  <Age>25</Age>
</Employee>

Monday, June 22, 2015

Performing SVN Merges using Eclipse plugin

In most projects automated merges are configured, this make's sure that your feature branch changes are forwarded to required future branch or trunk. But some times this may not happen, when any merge conflict arises i.e. automated merge cannot be performed or any previous revision merges on the same file are pending so your merge is waiting in queue. Or if automated forward merges are not configured, in such cases you have to manually merge the code base by resolving the conflict's if any and merge your changes into future branch or trunk and commit the changes.

In eclipse IDE Right click on the project or folder on which you want to perform merge, from context menu select Team->Merge as shown in below screen shot.

This brings up below shown merge wizard, from which you can choose type of merge you want to perform. Below I will be explaining various types of merges available.

Each of the merge type is explained below:

Merge a range of revisions

Manually record merge information (block one or more revisions)

This can be achieved by choosing the 5th option in the merge wizard as shown in below screen shot. Subsequent steps are almost same as in above explained flow, you will be asked to choose the merge from branch & path and then select the revision you want to block and choose finish. This will make necessary changes in the file svn configuration and now you can commit the file. This will block the specific revision from auto merging into the SVN branch and also will not block subsequent pending merges from happening.

Please feel free to ask questions if you have any using comments section and also feel free to ask to cover any specific area that is missed.

Saturday, June 13, 2015

Hadoop Misc

Here is the sample on using new map/reduce API

Regression using Apache Math API

`Simple Linear Regression`

In simple linear regression, a dependent variable y is predicted from one predictor variable x.

y = intercept + slope * x

also written as y = b * x + A

x - the independent variables which form the design matrix

y - the dependent or response variable

`Multiple Linear Regression`

In multiple regression, the dependent variable is predicted by two or more variables.

Equation with 2 predictor variables is y = b1 * x1 + b2 * x2 + A

The values of b (b1 and b2) are sometimes called “regression coefficients” and
sometimes called “regression weights.”

Y=X*b+u
where Y is an n-vector regressand, X is a [n,k] matrix whose k columns are called regressors, b is k-vector of regression parameters and u is an n-vector of error terms or residuals.

Here X[n,k] denotes k number of independent variables and n number of observations (rows).
Y is a [n,1] matrix/array

Linearity of a problem can be confirmed if the coefficient of determination (R²) is large i.e. R² = 1 indicates that the fitted model explains all variability in $y$ , while R² = 0 indicates no 'linear' relationship.

Statistics Terminlogy related to Regression

Dependence is any statistical relationship between two random variables or two sets of data.

Correlation refers to any of a broad class of statistical relationships involving dependence.

Variance measures how far a set of numbers is spread out. (A variance of zero indicates that all the values are identical.)
A non-zero variance is always positive: A small variance indicates that the data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out from the mean and from each other.

The square root of variance is called the standard deviation. The variance is one of several descriptors of a probability distribution.

Covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.
In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative.

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Reference:
www.wikipedia.org

HBase for beginners

Good starting point to understand basic concepts of HBase http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Hadoop Cluster Maintenance Tips

If disk space is full and you are getting resource availability errors then you can try to delete old distributed cache files and old job logs at the below given paths it frees up enough space.

/var/log/hadoop-0.20-mapreduce/userlogs

./mapred/local/taskTracker/cloudera/distcache

MPP (Massively Parallel Processing)/Green plum DB

MPP (Massively Parallel Processing)/Green plum made easy http://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/#comment-127

MapReduce and MPP

Interesting article MapReduce and MPP: Two sides of the Big Data coin?

MongoDB Schema Design Rules

Always prejoin (embedding) the entities other than trying to join them while querying, which is a common way in RDBMS. In Mongo only way to join at run time is via application logic, which is costly operation and clumsy way in Mongo.
This strategy also serves the purpose of constraints which are available in RDBMS. The Mantra is pre-join (embedding) at schema level
Even though MongoDB does not have transactions but we can still have atomic operations and can have consistent view of the data using pre-joining (embedding) data.
In case of one to many relationship and when many is very huge then linking of collections in recommended. Also same is recommended for
many to many relationship.
Influencers for when to embed and when to link

Frequency of access

To reduce the working set size of your application.

Size of items

If combined size of the documents is larger than 16MB

Atomicity of the data

Pre-join sample: A Product Catalog record

{
  sku: "00e8da9b",
  type: "Audio Album",
  title: "A Love Supreme",
  description: "by John Coltrane",
  asin: "B0000A118M",

  shipping: {
    weight: 6,
    dimensions: {
      width: 10,
      height: 10,
      depth: 1
    },
  },

  pricing: {
    list: 1200,
    retail: 1100,
    savings: 100,
    pct_savings: 8
  },

  details: {
    title: "A Love Supreme [Original Recording Reissued]",
    artist: "John Coltrane",
    genre: [ "Jazz", "General" ],
        ...
    tracks: [
      "A Love Supreme Part I: Acknowledgement",
      "A Love Supreme Part II - Resolution",
      "A Love Supreme, Part III: Pursuance",
      "A Love Supreme, Part IV-Psalm"
    ],
  },
}

Hive Partitions - How they look

Consider Hive internal table with table name as xx and it has two partioning columns c & d. The underlying Hive data file looks as follows in the below given path

File Path: /user/hive/warehouse/bt.db/xx/c=1/d=1/partitiontest

File Contents:

1,a
2,b
3,c
4,d

Hadoop Cluster Troubleshooting

Node health is bad, with reason: Heart beat failure
The CM agent is likely not running on that machine, perhaps because it was not configured to start at boot time. You should check and start it if it's not running.

service cloudera-scm-agent status
service cloudera-scm-agent start

How to leave Name node from safe mode

dfsadmin -safemode leave

Code Merge Mantra - Handy tip for Developers

You may be writing code for either product or an application there will be multiple versions and releases involved. Some version control system such as SVN would have been employed in your organization. If you are working on a branch slated for release R1 and which is prior to release R2 in timeline then your changes has to make its way into R2 branch also.

Generally in most of the times an auto merge is configured from R1 to R2 branch, so that every change performed in R1 is automatically forwarded to R2. This may result in to couple of outcomes,

One is forward merge could result in to successful merge so everything is all right.
Or forward merge fails because of merge conflicts and you may be flocked with continuous mails to correct the conflict and perform manual merge in R2 branch and resolve the merge failure error.
Or your merge may be waiting on other forward merges from R1 to R2 to complete. These are commits performed in R1 prior to your commits.

Both of the last two cases (2 and 3) are painful and should be avoided all times. And worst case your R1 change not making its way into R2, which is a time-bomb and can blast during R2 release.

Here I am going to explain how to avoid and over come forward merge failures. Even though we want every thing to be automated but in real life we have to take up manual root some times. Same applies to my suggestion. After you have committed your changes into R1 wait for 5 minutes and check in R2 branch whether your changes are automatically merged? if not don't wait, perform manual merge of your changes into R2.

How to make Manual Forward Merge

Use your source control merge tool and fire up the manual merge. It asks for which branch to merge from, in this case R1. Now it lists the revisions you want to merge. Select your latest revision and perform the merge and commit your changes. This ensures that you wont get any more merge failure error mails and your changes are successfully made their way into latest branch R2.

Tips when auto merges are configured

Generally regular commits is recommended once your code is compilable and does not break any other code; but when auto merges are configured your source control system becomes double edged sword. Your every commit may result into above explained scenarios and can cause much more disturbance and work for you and several others (such as configuration management team and other developers). So best practice is commit your changes only when you have done your testing and sure that no more changes are required on the same piece of code. This will reduce the number of additional manual forward merges.

Wednesday, June 24, 2015

Monday, June 22, 2015

Merge a range of revisions

Manually record merge information (block one or more revisions)

Saturday, June 13, 2015

Simple Linear Regression

Multiple Linear Regression

Hadoop Cluster Maintenance Tips

MongoDB Schema Design Rules

Hive Partitions - How they look

Hadoop Cluster Troubleshooting

Node health is bad, with reason: Heart beat failure

How to leave Name node from safe mode

dfsadmin -safemode leave

`Simple Linear Regression`

`Multiple Linear Regression`