Alfresco feedback

Below is transcript of the email from Jon Cox re: our link proposal.

----From: Jon Cox [mailto:jon.cox@alfresco.org]
Sent: Thu 1/11/2007 5:07 PM
To: Kevin Cochrane
Subject: (forw) Re: MIT, link management & wikis

> Here's my proposal draft description thing of a possible Link
> Management System for Alfresco WCM:
> --------------------------------------
>
> Link Management System
> The WCM should provide away to manage intra-site links.

I see link management as spanning the following areas:

     o Updating links that appear in pages
     o Updating the server's interpretation of URIs
     o Detecting & fixing problems

> If an HTML document within a managed site contains a link to another document
> within the site, the WCM should take care of this, not the content
> author. If a document is moved within the site or renamed, any links
> to it should be updated automatically.

   Sometimes when you move an asset, the intention is purely
   administrative/organizational; however, it's also very
   common for the rename to have semantic significance.
   Typically, you only want to "fix" links in the first
   case. Even there, to be scalable, it's important to
   allow the user/admin to make the proper tradeoff regarding
   when/how the "fixup" takes place.   In some situations,
   the best approach is a batch template recompilation process;
   other times, you want to do it incrementally via a
   level of indirection every time a link is clicked.

   For scalability, when the batch template recompilation
   method is used, you typically don't want to force
   the regeneration every time a change is made.   Also,
   you don't want to force *everything* to be fixed up
   when ever *something* is fixed up.   Sites can contain
   huge numbers of pages, so in order to strike the right
   balance between ease of development and runtime efficiency
   on the end-user-facing site, the key is to relax
   constraints within the low level ops, but then provide
   automatable high-level features that can be triggered
   by explicit user actions and/or by customizable workflows.

   As for runtime-resolved/computed links, these are nicest in
   areas where click-traffic is lower, you want to have a lot
   of flexibility to do stuff like page failover or search/heuristic
   driven page resolution (e.g.: turning certain links into
   something like Google's "I'm Feeling Lucky" button... only
   it looks like a normal bookmarkable link, not a button).

   It's my intention to provide rich support for both,
   and for a fair amount of mixing/matching between
   these extremes (e.g.: pre-compiled pages should be
   able to contain links that are resolved on-the-fly).

A URI may be intended to denote a wide array of
different things. For example:

[1] A named location
(whatever happens to be on *this* splash page)

[2] A link to a specific version of an immutable asset
(an objid of a document within a DM system)

[3] A link to an immutable asset within an immutable closure
(a permalink within a archived edition of a website)

     [4] A function callsite.
         (server-parsed wiki terms, explicit GET URIs)

Automating the management of URIs that are created with different
semantics implies preserving whatever semantics were intended.

Case [1] is typically maintained automatically via templates
and validated by a link checker.

Cases [2] and [3] require that you *never* change things.
A system that updates these links is violating their semantics.

   Case [4] is most commonly seen in wikis, and behavior-driven sites
   (e.g.: links take you to different places, depending on some
   combination of the server's state, cookies, referer headers, etc).

   I'm assuming you would like to create some links of type [4],
   and update their interpretation within a server/servlet,
   rather than by modifying their literal "byte representation"
   within web pages. Is that correct?

> This may represent a large shift in the way the Alfresco WCM
> product is implemented,

No, I've been planning for stuff related
to this topic for the past year.

> or it may be just a module which could be easily turned on
> or off.

Yes.

   It's probably not a matter of turning on/off a global
   switch, but instead just figuring out where you'd
   like such links to appear, how you'd like to see
   them denoted within pages so that sever-side logic
   can distinguish them from other kinds of links,
   and how you'd like to manage this system.

> The following is a sketch of how this might work. Keep in mind that
> this is not the only way to accomplish this goal, and that details may
> be slightly or even wildly inaccurate.

There really are a lot of ways to go here.
After our upcoming release, I'll have a chance
to share more of the details of what's been
lurking in my notebook.

All feedback/input is welcome.
It's gratifying to see that you're excited about this stuff,
and might be interested in playing around with a hybrid
compile-time / runtime system. I think there's a huge
amount we can all learn about the best way to package
the developer, admin, and end-user controls.

> ---
>
> The WCM needs to maintain a set of Link Objects. Each Link would
> contain the following properties: name: a unique human-readable name by
> which this Link can be identified. target: the target document of this
> resource, if the resource is an intra-site document managed by the WCM,
> or null, if it is an external document not managed by the WCM. url:
> the location of the resource title: the title of the document at the
> URL dependencies: a list of documents managed by the WCM which refer to
> this Link.
>
> Sometimes Links will be created automatically. For example, whenever a
> new document is created in the WCM, it will generate an associated Link
> (with the document as its target property). The properties of this
> Link are also maintained automatically. If the document is renamed or
> moved, this would trigger a change in the Link's url. If the document
> is retitled, this would trigger a change in the Link's title.

It sounds like you're talking about a few different
things here. We do plan upon allowing you to
regenerate pages automatically when the XML and/or
template that created them is modified.

Roughly, the way this will work is for the system to
determine the transitive closure all things that
need regeneration within a specified region of the
system, then do a topological sort to figure out the minimal
set of actions required to achieve the regeneration
(after identifying any dependency loops).

As for introducing a dynamic level of indirection in links,
that's fine... but in the end you still have object metadata
and/or relational queries and "business logic" to maintain,
performance, reliability and archiveability concerns, etc.
Again, there's no single best answer for all applications.
This is why I favor a highly modular approach.

Which ever compile/run-time choices are made, detecting
and fixing problems is critical. Therefore, we'll want
to make it very easy to hook in whatever sort of crawler
you want into the system (plus have something reasonable
available out-of-the-box).

There are a few features any self-respecting link validator
should have, such as controlling the depth, scope, timeouts,
mime-specific rules, flexible reporting, scriptablity,
the ability to gather performance metrics, etc. It would
be nice to know what's most important to you... preferably
in the form of a prioritized list.

> There will also be a Link creation user interface, which is accessible
> during content authoring. If a user wants to refer to an intra-site
> document which has not yet been created, this UI will collect the
> required information and would create both the Link and the target
> document. Also, if the user wants to refer to an external document
> (that is, link.target == null), the UI will collect the required
> information and would create just the Link object.

There is a *lot* of overlap between link management and wikis.

Again, here's my take on link management:

     o Updating links (typically via regeneration)
     o Updating the interpretation of URIs
     o Detecting & fixing problems

My definition of a wiki is general, but still fairly main-stream:

      A wiki is just a facade over query URLs & other markup;
      the intention of the facade is to make in-context editing,
      inter-page linking, and preview/publish operations as easy
      as possible.

It sounds to me as though you want link management *and* a wiki.

Wikis have languished in large organizations because there is
no real management system behind them, no good way to repurpose
content, import and reformat content automatically, etc.
With Alfresco, many of these key features could be implemented
in a forms-driven, sandboxed/virtualized way. Getting the
details right is hard & careful work, but well worth it.

Because of the intimate relationship between how wikis *should*
work, and the kinds of link management options that we're
seeing a clear demand for, contemplating these topics together
is very important. Obviously, some link management operations
are applicable to content that has nothing to do with
wikis at all, but envisioning what you'd really *like* to do
with a wiki can tell you a lot about what you'd like to do
with managing links.

> ---
>
> Documents managed by the WCM may refer to Links within their content.
> The files stored in a managed website are not necessarily the exact
> files to be previewed or published. They may be blueprints from which
> the published files are built. This could be accomplished by allowing
> all managed documents to be stored as, say, Freemarker templates, which
> may assume that appropriate Link variables exist:

They can be templates and their associated forms;
thus your "wiki" interface might really be a
fully general form interface with the ability to
insert certain kinds of markup like paragraphs
and lists via form controls that are selectable
from a menu. The raw output of the form could
be well-structured XML that's presented via
freemarker, xslt, or whatever you'd like.
The output of the XML rendering engine could
be pretty much anything too... from static html
to a JSP, or some other on-the-fly server-parsed
format of your choosing.

> For example, the document "something.html" would be stored with
> embedded FTL tags: To find out more about our products, see <a
> href='$

Unknown macro: {links["Products"].url}

'>$

Unknown macro: {links["Products"].title}

</a>.

For big sites, it's usually best for the bulk of the pages
to be precompiled via a temlating system that allows
the majority of users to enter well-structured data only.

This keeps the data in a form that allows it to be
re-purposed by other data rendering engines,
should the need arise to do structured queries
on it, or have the data appear in other output
formats (e.g.: .pdf, .doc, with rss feeds, etc.).

> The exact methods of doing the templating could be hidden from the
> content author, who instead is presented with a editor UI allowing the
> selection of or creation of Links.

Users should have a UI for entering content.
I see links as just another kind of content.

   From this perspective, anything we do to help the user create
   special kinds of links needs to be thought about from within
   the context of the overall forms-driven UI.

   The implication here is that our current forms interface
   needs to be a lot more general. It needs to allow
   a form author to insert arbitrary widgets. Currently
   we just supply a tool that autogenerates forms from XSDs.
   That's a nice thing but it's not enough. We know.

   Ideally, I think we should let you configure any sort
   of custom form with the main Alfresco GUI, and make it
   easy for you to to cause the main GUI to bring up *your*
   form again when the captured data is re-edited.

> If the content author selects an
> existing Link, the underlying template document would include the
> necessary markup, but the content author would not need to know about
> it. If the content author creates a new Link, the WCM would
> additionally create an associated target document if one is required.
> The content author should be allowed but not required to be involved in
> the details of the templating.

Transparency lends itself very nicely to location-based links.
Function-style links could still be virtualized via referer headers
within a servlet. All of this is also neatly orthogonal to
dependency manager mediated regeneration for pages that include
"hard coded" links/layout, and content that has been sucked in
from elsewhere (e.g.: XML files).

The nice thing about doing this on top of Alfresco's production
model & layered repository is that you can have staging hierarchies.

Alfresco provides a virtualized, "sandboxed" development environment,
so when we think about links (whether the hard-coded, or dynamically
resolved variety), there's always an eye toward doing what we
can to enbable parallel/team contribution.

Therefore, I'm keen on avoiding singletons when it comes
to how links are resolved. For example, I see resolution via
a straight database lookup followed by retrieving a blob
(as is typical of most wikis) as a throwback to the days
that preceded content management. I'm certain we can
improve on this. Concretely, in a normal wiki, the choice
is "preview or publish to everybody". Using Alfreco's
layers/sandboxes, custom forms (and possibly special-purpose
mime type parsers in the live rendering environment),
you could create a system that let you publish to a limited
group, and then promote/deploy via workflow upon approval.

Obviously, some of this is a bit forward-looking because our
workflow system and forms-based systems need some further
enhancements, and if there's custom markup for links in certain
forms the best grammar & server-side parsing is still TBD.

I've been very quiet about all this stuff because we're trying to
get a release out the door, but it's affirming in many ways to see
you going down some of the same paths. When I have more time,
I'll try to flesh out my plans in more detail. Comments/feedback
is greatly appreciated.

> ---
>
> Documents declared to be templates in this way would then need to be
> transformed into their final static form by the WCM prior to previewing
> or publishing.

    Or totally on-the-fly.
    Again, it depends on the particulars.
    These are the sorts of trade-offs I'll be looking at.

> Continuing with the Freemarker example, this
> transformation would generate the corresponding html file:
>
> To find out more about our products, see
> <a href='http://oursite.com/store/products.html'>Oursite's Products</a>.
>
> This transformation step could be triggered by preview or publish, in
> which case Link dependencies may not need to be maintained. Only at
> preview/publish time would the Links actually be resolved. This is the
> easiest way to maintain consistency, but may cause previewing and
> publishing to be slow.
>
> Or, transformation could be triggered immediately upon document
> creation, which would presumably add the document to the dependency
> lists of the referenced Links. When any of a document's referenced
> Links are subsequently updated, the document would be re-transformed.
> This pushes the load onto document creation. Publishing is quick, but
> Link updates are slow.
>
> A third, hybrid option is where the transform takes place upon document
> creation, and updating a Link would only cause all the dependencies to
> be -marked- for future retransformation. This future retransformation
> could happen during spare cycles or, at the latest, upon
> preview/publish.

   You could do it more lazily than this.
   On the first parse, update a table in a database.
   Cache the result of the parse just like you'd cache
   a compiled JSP (and invalidate it and the corresponding
   table entry in the same way). Don't bother computing
   lists of inbound/outbound links until actually asked.

   This way, could could do it incrementally during
   development, and/or do batch compilation and deploy
   both the pages & their caches at publish time
   (again, like JSPs).

> ---
>
> That is a rough sketch of a Link Management System. Some of the
> details seem well-suited to Alfresco's framework. The event triggering
> mentioned could be implemented via workflow. The Link objects
> themselves would probably reside in the repository as either separate
> objects or as aspects of documents. The transformation and
> all-documents-are-templates components may be more of a hassle to
> implement as a module.

o Tomcat is already set up to allow for custom
parsers to be attached to mime types.

o Dependency data is best managed in a database.
The parser could just talk to MySQL (or whatever).

   o The files themselves should just be kept within
      Alfrseco as normal files. This way, you can
      edit them via cifs, manipulate them via the
      normal GUI, use the fact that you've got
      transparency, etc.

> ---
>
> Documents declared to be templates in this way would then need to
> be transformed into their final static form by the WCM prior to
> previewing or publishing.
>
> Actually, I think all templates should be transformed in this way.

I think there's a place in the world
for precompiled and on-the-fly templates.
Alfresco will probably remain neutral
on this issue, and all the product to
be configured in whatever way best
suits the needs of its users.

> Right now, the WCM uses XSLT to transform data (xml from XForms) into
> documents which sit in the repository alongside the data. This is
> confusing in that the generated document could conceivably be edited by
> the content author, when it's really just an aspect of the source
> document.

If you edit a document that has been autogenerated
via a form, the GUI will associate the edit action
with the form again, and pre-load the form with the
data it used to generated the file.

We don't want to dictate where people put their
generated files, or how they inter-mix generated
vs hand-crafted files in their directory structures.

The fact that most people use a GUI to edit/generate
files, and that the GUI handles this scenario in
a manner that isn't ambiguous means that in most
practical situations confusion won't arise.

If you want to alert those who use command-line tools
to the fact that a file was generated via some tool,
the standard practice is to insert a comment in the
file to this effect via the template itself. This
allows hand-edits to occur when a developer *really*
wants to do one; (presumably such a person would
have a valid reason to defy such a warning... such
as rapid-turnaround tweaking in the output as
a first phase of trying to debug the template itself.

> That's like using a build system and the binary sits
> alongside the source in your repository.

This analogy does not quite match the situation you
have with auto-generated html vs hand-crafted html,
because each kind of html file may need to have links
to the other, and there may be site-specific reasons
why the links of one (or both) need to be relative
(or absolute) to a particular location in the site.

> The right behavior (or at least a right behavior) would be that this
> transformation is not something that becomes a physical document until
> preview or publish time.
>
> In fact, just like a reasonable build system, it should be possible for
> destination documents to have many-to-many relationships with source
> documents. Posted by Joseph A Calzaretta at Nov 28, 2006 15:11 | Permalink
>
> I don't believe that dependencies are a property of a link object. The
> dependencies belong somewhere else, like a general prupose dependency
> map. The relationships should probably go (dependency -> link) rather
> than (link -> dependency) Posted by Catherine T Iannuzzo at Jan 05,
> 2007 17:00 | Permalink

Yes; dependency info should be totally out-of-band.

Cheers,
-Jon

Child pages

Alfresco feedback