…On AWS, in High Availability, Auto scalable and Multi AZ support.
Back in 2013 we (at Alfresco) released an AWS CloudFormation template that allow you to deploy an Alfresco Enterprise cluster in Amazon Web Services and I talked about it here.
Today, I’m proud to announce that we have rewritten that template to make it work with our new modern stack and version of Alfresco One (5.1). We also put a lot of effort in place to make this new template an Alfresco One Reference Architecture, not only because it is in hight availability, but because we use our latest automation tools like Chef-Alfresco and our experience on tuning but also best practices learned during our latest benchmarks and related to architecture security. In addition to that and to make it faster to deploy, we are using the official Alfresco One AMI published in the AWS Marketplace.
I’d like to mention some features you will find in this new template:
All Alfresco and Index nodes will be placed inside a Virtual Private Cloud (VPC).
Each Alfresco and Index nodes will be in a separate Availability Zone (same Region).
We use Alfresco One 126.96.36.199 with Alfresco Offices Services and Google Docs plugin.
All configuration is done automatically using Chef-Alfresco, you don’t really need to know about Chef to make this work.
An Elastic Load Balancer instance with “sticky” sessions based on the Tomcat JSESSIONID.
Shared content store is in a S3 bucket.
MySQL database on RDS instances in Multi-AZ mode.
We use a pre-baked AMI. Our official Alfresco One AMI published in the AWS Marketplace, based on CentOS 7.2 and with an all-in-one configuration that we reconfigure automatically to work for this architecture and save time.
Auto-scaling rules that will add extra Alfresco and Index nodes when certain performance thresholds are reached.
HTTPS access to Alfresco Share not enabled by default but all set to enable it.
As a result of this deployment you will get this environment:
In the video below you can see a quick demo about how to deploy this infrastructure in just few minutes of user intervention. Isn’t is cool? Do you know how much time you save doing it this way? And also set up a production and test environment exactly the same way, faster, easier and cheaper!
The main requirement on the shared storage is being able to cross-mount the storage between the Alfresco servers. Whether this is done via an NAS or SAN is partly a decision around which technology your organization’s IT department can best support. Faster storage will have positive implications on the performance of the system, with Alfresco recommending throughput at 200 MB/sec.
NAS allows us to mount the content store via NFS or CIFS on all Alfresco servers, and they are able to read/write the same file system at the same time. The only real requirement is that the OS on which Alfresco is installed supports NFS (which is any Linux box actually). NFS tends to be cheaper and easier, but is not the fastest option. It is typically sufficient, though.
SAN is typically faster and more reliable, but obviously more expensive and complex (dedicated hardware and configuration requirements). In order to read/write from all Alfresco servers from/to the SAN, special file system types are necessary. For Red Hat, we use GFS2, other Linux flavors use OCFS or many others.
You are maybe thinking what happen in case of having multiple Alfresco servers writing to the same LUN could result in corruption (especially in header files), so it sounds like NAS (NFS/CIFS) would take care of that issue, however, if using a SAN, the filesystem must be managed properly to allow for read/write from multiple servers. For the Alfresco stand point, you don’t have to take care of that in both SAN or NAS approaches because Alfresco manages the I/O such that no collisions or corruption occur.
Note: If using a SAN, ensure the file system is managed properly to allow for read/write from multiple servers.
I also wanted to share this presentation I did internally some time ago but I think it would be useful.
Alfresco Office Services is the new implementation made for Alfresco One (former Enterprise) which allows an user to “Edit Online” a document with MS Office straight from Alfresco Share, provides a fully-compatible SharePoint repository. This new implementation replaces the existing VTI (Microsoft Office SharePoint Protocol Support) already in Alfresco Community.
With Alfresco Office Services (AOS) you can access Alfresco directly from your Microsoft Office applications. This means that you can browse, open, and save Microsoft Office files (Word, PowerPoint, and Excel) in Alfresco without the need to access Alfresco through Chrome, Firefox, or another web browser. (See oficial documentation here). With AOS you also can connect Alfresco as a network drive or shared folder.
The main differences between this new AOS and the existing VTI are:
Removed Jetty embedded server therefore not need to use port 7070.
Select document type when saving document form MS Office to Alfresco and fill the type properties within MS Office.
AOS is part of the core, not need to install an additional AMP.
A new ROOT.war and _vti_bin.war have to be deployed (included in Alfresco One), if you are upgrading from previous versions please check this information.
Here is a quick demo about how it works and how to use it:
If you often read this blog, you may already know what Alfresco is and how it works. As per the Alfresco Wiki: A Web Script is simply a service bound to a URI which responds to HTTP methods such as GET, POST, PUT and DELETE. While using the same underlying code, there are broadly two kinds of Web Scripts: data and presentation Web Scripts.
The book shows the reader what to know to be a web script developer: understand the Alfresco web script framework and how it works, components and architecture, writing a web script from scratch, types and options of web scripts with its components, how to use them from third party applications (which is very interesting in order to integrate Alfresco with others), embed Java in Web Scripts also knows as Java-backed web scripts, using Web Scripts with Java-script as well. Get to know all deployment options, debugging and troubleshooting, and also the very important maven options available with web scripts deployments.
I liked this book because it goes from very foundational information to really deep level concepts, so if you are looking to start learning web scripts from scratch and go beyond, it is a good option to have a single point of consultation. This is a pure web scripts book, if you are looking for a 5.0 updated book this is not your book, because it doesn’t cover Aikau, but remember that it covers most importan topics to start working with different flavours of web scripts. And after all, it is oriented for both beginners and advanced developers.
During last few years, I have seen dozens of Alfresco installations in production without any kind of tuning. That makes me thing that 1) nobody cares about performance or 2) nobody cares about documentation or 3) both of them!
I know people prefer to read a blog post instead the product official documentation. Since Alfresco have improved A LOT our official documentation and most of the information provided below can be found there, I want to point out some tips that EVERYONE has to take into account before going live with your Alfresco environment. Remember, it’s easy Tuning = Live, No Tuning = Dead.
Tuning the Alfresco side:
Increase number of concurrent connections to the DB in alfresco-global.properties
# Number below has to be the maxThreads value + 75
Increase number of threads that Tomcat will use in server.xml – section 8080, 8443 and 8009 in case you use AJP
Adjust the amount of memory you want to assign to Alfresco in setenv.sh or ctl.sh (which is the default one):
export CATALINA_OPTS=" -Xmx=16G -Xms=16G"
in JAVA_OPTS make sure you have the flag “-server” that gives 1/3 of memory for new objects, do not use “XX:NewSize=” unless you know what you are doing, Solr takes many new objects and it will need more than 1G in production.
I’m not very good on Windows so I will cover only a few tips for Linux:
Change limits in /etc/security/limits.conf to the user who is running your app server, for example “tomcat”:
tomcat soft nofile 4096
tomcat hard nofile 65535
If you start Alfresco with a su -c option in /etc/init.d/, for Ubuntu you have to uncomment the pam_limits.so line here /etc/pam.d/su, if this is using login (by ssh) it is uncommented by default. For RedHat/Centos this line has to be uncommented here /etc/pam.d/system-auth.
Your storage throughput should be greater than 200 MB/sec and this can be checked by:
# hdparm -t /dev/sda
Timing buffered disk reads: 390 MB in 3.00 seconds = 129.85 MB/sec
Allow more concurrent requests by editing /etc/sysctl.conf
A server full reboot is a good preventive measure before going live, it should start all needed services in case of contingency and we will find if we left something back on the configuration.
Remember, this is ONLY A SHORTLIST, you can do much more depending on your use case. Reading the documentation and taking our official training will be helpful and take advantege that we were polishing our training materials lately.
Alfresco security check list is a list of elements to check before going live with an Alfresco installation in a production environment. This check list is part of the Alfresco Security Best Practices Guide, but I wanted to give it a post in case you missed (thinks that happen due to the 30+ pages of the guide).
As part of the work I’m doing for the upcoming Alfresco Summit, where I will be talking about my favorite topic: “Security and Alfresco”, I have written a few lines about Alfresco node deletion, how it works and why is important to take it into account in terms of security control.
I just wanted to clarify how Alfresco works when a content item is deleted and also how content deletion works in Records Management (RM). Basic content deletion is already very well explained in this Ixxus blog post but there are some differences in the database schema between Alfresco 4.1 and 4.2 worth noting, such as the alf_node table has a field named ‘node_deleted’in versions 4.0 and earlier.
To develop a deep knowledge about Alfresco security and also how to configure Alfresco backup and disaster recovery, you should first need to understand how the Alfresco repository manages the lifecycle of a content item. Node creation:
When a node is created,regardless how it is uploaded or created in Alfresco (via the API, web UI, FTP, CIFS, etc.)Alfresco will do the following:
Metadata properties are stored into the Database in the logical store workspace://SpacesStore (alf_node, alf_content_url among others).
The file itself is store and renamed as .bin under alf_data/contentstore/YYYY/MM/DD/hh/mm/url-id-of-the-file.bin
Next, depending on your indexing you chose, its index entries are created within Lucene (alf_data/lucene-indexes/workspace/SpacesStore) or Solr (alf_data/solr/workspace/SpacesStore).
Finally, in most cases, a content thumbnail is created as a child of the file created.
There are two phases to node deletion: Phase 1- A user or admin deletes a content item (sending it to the trashcan):
When someone deletes a content item, the content and its children (eg. thumbnails) are moved (archived) within the DB from workspace://SpacesStore to archive://SpacesStore. Nothing else happens in the DB.
The actual content “.bin” file remains in the same location inside the contentstore directory.
Finally,the indexes are moved from the existing location to the corresponding archive alf_data/lucene-indexes/archive/SpacesStore) or Solr (alf_data/solr/archive/SpacesStore) depending on your index engine selection.
NOTE: A deleted node stays in the trashcan FOREVER, unless the user or admin either empties the trashcan or recovers the file. This default” behavior can be changed by using third party modules that empty the trashcan automatically on a custom schedule. See below for more information on these modules.
The trashcan may be found at these locations: Alfresco Share: User -> My Profile -> Trashcan (admin user will see all users deleted files, since 4.2 all users can also see and restore their own deleted files). Alfresco Explorer: User Profile -> Manage Deleted Items (for all users). Phase 2- Any user or admin (or trashcan cleaner) empties the trashcan:
That means the content is marked as an “orphan” and after a pre-determined amount of time elapses, the orphaned content item ris moved from the alf_data/contentstore directory to alf_data/contentstore.deleted directory.
Internally at DB level a timestamp (unix format) is added to alf_content_url.orphan_time field where an internal process called contentStoreCleanerJobDetail will check how many long the content has been orphaned.,f it is more than 14 days old, (system.content.orphanProtectDays option) .bin file is moved to contentstore.deleted. Finally, another process will purge all of its references in the database by running nodeServiceCleanupJobDetail and once the index knows the node has bean removed, the indexes will be purged as well. NOTE: Alfresco will never delete content in alf_data/contentstore.deleted folder. It has to be deleted manually or by a scheduled job configured by the system administrator.
By default, the contentStoreCleanerJobDetail runs every day at 4AM by checking how the age of an orphan node and if it exceeds system.content.orphanProtectDays (14 days) it is moved to contentstore.deleted.
Additionally, the nodeServiceCleanupJobDetail runs every day at 9PM and purges information related to deleted nodes from the database. Now, that we understand how Alfresco works by default, let’s learn how to modify Alfresco’s behavior in order to clean the trashcan automatically:
There are several third party modules to achieve this, but I recommend the Alfresco Trashcan Cleaner by Alfresco’s very own Rui Fernandes. Tt can be found at https://code.google.com/p/alfresco-trashcan-cleaner/.
Once the amp is installed, you can use this sample configuration by copying it to alfresco-global.properties:
The options above configure the cleaner to run every hour at thethe half hour and it will remove content from the trashcan and mark them as orphan if a content has been in the trashcan for more than 7 days. It will do this in batches of 1000 deletions every time it runs. To delete from the trashcan without waiting any grace period set the trashcan.daysToKeep property value to -1. Can I configure Alfresco to avoid using contentstore.deleted and ensure it really deletes a file after the trashcan is cleaned?
Yes, this is possible by setting system.content.eagerOrphanCleanup=true in alfresco-global.properties and once the trashcan is emptied, the file will not be moved to contentstore.deleted but it will be deleted from the file system (contentstore). After that, nodeServiceCleanupJobDetail will purge any related information from the database. Using sys:temporary aspect it also perform same behavior. So, what is the recommended configuration for a production server?
This is something you have to figure out based on your backup and disaster recovery strategy. See my Alfresco Summit presentation and white paper here: http://blyx.com/2013/12/04/my-talk-about-alfresco-backup-and-recovery-tool-in-the-alfresco-summit/.
If you have a proper l backup strategy, you can offer your users a grace period of 30 days to recover their own deleted documents from the trashcan and after the grace period delete them simultaneously from the trashcan and the filesystem. This can be achieved by installing the previously mentioned trashcan-cleaner and with this configuration in alfresco-global.properties:
And what about Alfresco Records Management, does it work in the same way? How a record destruction works?
In the Records Management world you don’t tend to delete documents as often it is done in Document Management. When a content item is deleted from the RM file plan, it is considered to be a regular delete operation. This is rarely used and only done by RM admins when there is some justifiable reason such as correcting a mistake that requires a record to be removed.
The only difference is that the deleted record by-passes the archive store, hence it never goes to the trashcan, it is marked as orphan once it is deleted. Then it will be moved to contentstore.deleted after orphanProtectDays or it is truly deleted if eagerOrphanCleanup is set as true.
Destruction of a record works in the same way that a record is removed, this will by-pass the archive and immediately trigger the clean-up (eagerOrphanCleanup) process so the content does not stay in the file system contentstore or contentstore.deleted.
As far as the meta-data goes, there are two options; the first is that all the meta-data (and hence the node itself) are completely deleted, the alternative method cleans out all the content but the node remains with only the meta-data (called ghosting). In Alfresco RM versions before 2.2 this was a global configuration value (rm.ghosting.enabled=true), in 2.2 it can be defined on the destroy step of the disposition schedule: “Maintain record metadata after destroy”.
Some final words on content deletion:
As we have seen, Alfresco offers different ways to delete content. It is important to remember, even if Alfresco completely deletes content such as when using the destroy option in RM or by using eagerOrphanCleanup, Alfresco will not wipe the removed content from the physical storage, it therefore can be recovered by file system recovery tools. Wiping a deleted content item may vary depending on multiple factors, since filesystem type to hardware configuration, etc. If you want to guarranty a real physical wipe of a file in your file system, a third party software must be used to “zero out” the corresponding disk sectors. The specific tools depend on the operating system type, hardware, etc.
Thanks to my colleagues at Alfresco Kevin Dorr, Roy Wetherall for the Records Management section and Luis Sala for the document syntax review.
If you are not aware about what IFTTT is, I recommend you to take a look in to this https://ifttt.com/wtf and then come back here to continue reading this blog post.
Here a brief demo about this integration, more details and configuration steps below.
Once you know what “if THIS then THAT” is, I want to explain how I have made a seamless integration with Alfresco using some very straightforward receipts and sending information to Alfresco in the THAT (action) part of its receipt.
Since there is not an Alfresco channel in IFTTT (yet), the data flow is from almost any channel to Alfresco using “Send an email from GMAIL” to Alfresco inbound email service (to a folder). I mean, this article is about how to send multiple kind of data from several IFTTT channels to Alfresco through the inbound email feature built in Alfresco.
In this screenshot you can see a self explained example:
When I liked a picture in Instagram, it will be sent to Alfresco, once in Alfresco, we have a world of possibilities like transformations, workflows, publication, alerts, etc.
What do we need for having this working? Here you go a list of steps to get this ready to go:
1- Enable your Inbound Email service in Alfresco:
For Alfresco One 4.2 this is very easy by using the new Admin Console http://localhost:8080/alfresco/service/enterprise/admin/admin-inboundemail. Explanation below.
For Alfresco Community refer to here http://docs.alfresco.com/community/concepts/community-videos-12.html and here http://docs.alfresco.com/community/concepts/email-inboundsmtp-props.html
As you can see in the screenshot above, I have made some changes to allow only emails from @blyx.com and from @alfresco.com, any one inside Alfresco and member of the EVERYONE group can send emails to a folder with an email alias aspect. My server is running in Linux and with a non-root user this is the reason I set port 1025, I have a port redirect to listen on port 25 from the internet. Examples of port redirect here http://docs.alfresco.com/community/tasks/fileserv-CIFS-useracc.html.
In the example I have created a folder called “Drafts” with the aspect Aliasable (Email):
Edit this folder properties and add a new value for Alias property, in my case drafts which will be the email address alias of this folder, like [email protected] (alias + @ + server FQDN). I don’t have to create a MX DNS record because I’m using the FQDN.
Now, I’m ready to send an email from an existing Alfresco user (and with permissions to create content) to Alfresco, in my case [email protected] is the user toni in Alfresco.
2- Create an IFTTT receipt like showed in the video above.
3- Enjoy thousands of ways to add contents to your Alfresco!