Everyday getting better

Hadoop Cluster Maintenance Tips

If disk space is full and you are getting resource availability errors then you can try to delete old distributed cache files and old job logs at the below given paths it frees up enough space.

/var/log/hadoop-0.20-mapreduce/userlogs

./mapred/local/taskTracker/cloudera/distcache

MPP (Massively Parallel Processing)/Green plum DB

MPP (Massively Parallel Processing)/Green plum made easy http://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/#comment-127

MapReduce and MPP

Interesting article MapReduce and MPP: Two sides of the Big Data coin?

MongoDB Schema Design Rules

Always prejoin (embedding) the entities other than trying to join them while querying, which is a common way in RDBMS. In Mongo only way to join at run time is via application logic, which is costly operation and clumsy way in Mongo.
This strategy also serves the purpose of constraints which are available in RDBMS. The Mantra is pre-join (embedding) at schema level
Even though MongoDB does not have transactions but we can still have atomic operations and can have consistent view of the data using pre-joining (embedding) data.
In case of one to many relationship and when many is very huge then linking of collections in recommended. Also same is recommended for
many to many relationship.
Influencers for when to embed and when to link

Frequency of access

To reduce the working set size of your application.

Size of items

If combined size of the documents is larger than 16MB

Atomicity of the data

Pre-join sample: A Product Catalog record

{
  sku: "00e8da9b",
  type: "Audio Album",
  title: "A Love Supreme",
  description: "by John Coltrane",
  asin: "B0000A118M",

  shipping: {
    weight: 6,
    dimensions: {
      width: 10,
      height: 10,
      depth: 1
    },
  },

  pricing: {
    list: 1200,
    retail: 1100,
    savings: 100,
    pct_savings: 8
  },

  details: {
    title: "A Love Supreme [Original Recording Reissued]",
    artist: "John Coltrane",
    genre: [ "Jazz", "General" ],
        ...
    tracks: [
      "A Love Supreme Part I: Acknowledgement",
      "A Love Supreme Part II - Resolution",
      "A Love Supreme, Part III: Pursuance",
      "A Love Supreme, Part IV-Psalm"
    ],
  },
}

Hive Partitions - How they look

Consider Hive internal table with table name as xx and it has two partioning columns c & d. The underlying Hive data file looks as follows in the below given path

File Path: /user/hive/warehouse/bt.db/xx/c=1/d=1/partitiontest

File Contents:

1,a
2,b
3,c
4,d

Hadoop Cluster Troubleshooting

Node health is bad, with reason: Heart beat failure
The CM agent is likely not running on that machine, perhaps because it was not configured to start at boot time. You should check and start it if it's not running.

service cloudera-scm-agent status
service cloudera-scm-agent start

How to leave Name node from safe mode

dfsadmin -safemode leave

Code Merge Mantra - Handy tip for Developers

You may be writing code for either product or an application there will be multiple versions and releases involved. Some version control system such as SVN would have been employed in your organization. If you are working on a branch slated for release R1 and which is prior to release R2 in timeline then your changes has to make its way into R2 branch also.

Generally in most of the times an auto merge is configured from R1 to R2 branch, so that every change performed in R1 is automatically forwarded to R2. This may result in to couple of outcomes,

One is forward merge could result in to successful merge so everything is all right.
Or forward merge fails because of merge conflicts and you may be flocked with continuous mails to correct the conflict and perform manual merge in R2 branch and resolve the merge failure error.
Or your merge may be waiting on other forward merges from R1 to R2 to complete. These are commits performed in R1 prior to your commits.

Both of the last two cases (2 and 3) are painful and should be avoided all times. And worst case your R1 change not making its way into R2, which is a time-bomb and can blast during R2 release.

Here I am going to explain how to avoid and over come forward merge failures. Even though we want every thing to be automated but in real life we have to take up manual root some times. Same applies to my suggestion. After you have committed your changes into R1 wait for 5 minutes and check in R2 branch whether your changes are automatically merged? if not don't wait, perform manual merge of your changes into R2.

How to make Manual Forward Merge

Use your source control merge tool and fire up the manual merge. It asks for which branch to merge from, in this case R1. Now it lists the revisions you want to merge. Select your latest revision and perform the merge and commit your changes. This ensures that you wont get any more merge failure error mails and your changes are successfully made their way into latest branch R2.

Tips when auto merges are configured

Generally regular commits is recommended once your code is compilable and does not break any other code; but when auto merges are configured your source control system becomes double edged sword. Your every commit may result into above explained scenarios and can cause much more disturbance and work for you and several others (such as configuration management team and other developers). So best practice is commit your changes only when you have done your testing and sure that no more changes are required on the same piece of code. This will reduce the number of additional manual forward merges.

Saturday, June 13, 2015

Hadoop Cluster Maintenance Tips

MongoDB Schema Design Rules

Hive Partitions - How they look

Hadoop Cluster Troubleshooting

Node health is bad, with reason: Heart beat failure

How to leave Name node from safe mode

dfsadmin -safemode leave