-
Software: A major part of the software component of our cloud is HDFS which is a distributed Java-based file system with the capacity to handle a large number of nodes storing petabytes of data. On top of the file system, there exists the map/reduce engine. This engine consists of a Job Tracker. The client applications submit map/reduce jobs to this engine. The Job Tracker attempts to place the work near the data by pushing the work out to the available Task Tracker nodes in the cluster. We have (are making) the following enhancements to the software infrastructure to support research and education in assured cloud computing with funds from our funded projects. We have devised a number of programming projects for our courses based on the infrastructure we are developing.
- Handle encrypted sensitive data: Sensitive data ranging from medical records to credit card transactions need to be stored using encryption techniques for additional protection. Currently, HDFS does not perform secure and efficient query processing over encrypted data. We are addressing this limitation in our research.
- Semantic web data management: There is a need for viable solutions to improve the performance and scalability of queries against semantic web data such as RDF (Resource Description Framework). The number of RDF datasets is increasing. The problem of storing billions of RDF triples and the ability to efficiently query them is yet to be solved. At present, there is no support to store and retrieve RDF data in HDFS and we have addressed this limitation.
- Fine-grained access control: HDFS does not provide fine-grained access control. Yahoo recently released a version of HDFS that provides access control lists. Unfortunately, for many applications such as assured information sharing, access control lists are not sufficient and there is a need to support more complex policies. This limitation is being addressed in our current work.
- Strong authentication: Yahoo version of HDFS supports network authentication protocols like Kerberos for user authentication and encryption of data transfers. However, for some assured information sharing scenarios; we may need public key instruments (PKI) for digital signature support.
-
Hardware: At UTD we already have substantial hardware to support our research and education in assured cloud computing. Our current hardware includes four major clusters with different configurations. The first cluster is very small in size and is generally used as our test cluster. It consists of 4 nodes. Each node has a Pentium-IV processor with an 80 GB hard drive and 1GB of main memory. The second cluster is placed in the SAIAL (Security Analysis and Information Assurance Lab with lab support) and has a total of 22 nodes. All the nodes in this cluster run on commodity class hardware on which Hadoop runs as well. This 22 node cluster has a mixed collection of hardware: 7 nodes have a Pentium-IV processor with 360GB of hard disk space and 4GB of main memory in each of them. The remaining 15 nodes also have a Pentium-IV processor with about 290GB of hard disk space and 4 GB of main memory in each. The third cluster is also placed in the SAIAL and consists of 10 nodes. Each node in this cluster has a Pentium-IV processor with 500GB of disk space and 4GB main memory. All these nodes are connected to each other via a 48-port Cisco switch on an internal network. Only the master node is accessible from the public network on each cluster. The fourth cluster to which we have access is the Open Cirrus testbed instrument from HP Labs. We also have 2 solid state disks incorporated into the already existing clusters.